Connections guide for CDC connectors

This quick guide discusses Decodable’s CDC connectors for relational databases. It looks at their common behavior, explains when and how to make changes, and addresses typical issues you may face when deploying Decodable connections based on these connectors.

Common behavior

When activating a CDC source connection, the stream mappings and stream schema are bound to the connection at activation time. The configured tables and the corresponding stream fields will be captured and any given configuration applies until the next activation of the connection.

Any schema changes happening in the upstream database tables - for example new columns added or column data types changed - won’t be automatically picked up. This means that columns added upstream to tables tracked by the connection will be ignored, until they’re manually added to the corresponding Decodable stream, and the connection is restarted.

Connection failures

A connection will fail when it’s unable to successfully deserialize an input record based on the specified stream mappings and schema. This typically happens if a backward incompatible change is made to the tables or columns bound to the connection, for example:

a NOT NULL column is dropped
a column’s data type is changed from String to Date

Updating connections

Any change to a connection’s configuration requires a restart of the connection. While some updates are state compatible others require resetting the state.

Read Connection starting state and offsets to learn more.

Supported modifications without resetting the state

The following lists modifications which can be made to a connection without the need to resetting its state:

Adding new stream mappings: For specific CDC source connectors - currently MySQL and PostgreSQL - newly added stream mappings will snapshot the newly mapped tables before consuming the corresponding change events. This ensures that all the existing records in the newly added tables are ingested first. It’s possible to opt out of this snapshot behavior for new tables by setting the scan.startup.mode property for the connection to latest-offset which defaults to initial.
Removing stream mappings: When a stream mapping is removed, the connection will stop processing the change events from the previously mapped table. Any other table to stream mappings are unaffected by this change.
Backward compatible column type changes: Removing a previously existing constraints, for example, NOT NULL is removed or widening a column’s data type, for example, change INT to BIGINT.

Operations requiring resetting the state

The following lists modifications which can’t be made to a connection without resetting its state:

Ingesting a new column: Resetting the state of the connection is the safest way to ingest a newly added column for all records. Without resetting the state, some records may have missing values for the newly added column.
Backward incompatible column type changes: Any backward incompatible data type changes have ripple effects to the downstream jobs. The safest way is to reset the state to make sure all the records are reprocessed with the correct data type.

Reprocessing guide

In an end-to-end data flow, there are intermediate states held in different places. These include:

connections
streams
(custom or SQL) pipelines if present
external sink systems written to

When resetting the state for a specific connection, it’s important that the derived or propagated state downstream of the connection in question is cleared as well.

Option 1: Reprocess with existing resources

Using the following steps you can reprocess the data "in place" by using your existing Decodable resources:

Reprocessing CDC flow with existing resources

Stop the source connection and all the pipelines or connections downstream.
Clear all streams downstream of the originating source connection.
Truncate the external resource/s the sink connection/s at the end of the data flow is/are writing to.
Restart the source connection and all the pipelines or connections downstream with start options discard state and earliest offset.

Lineage View is helpful to get the proper visibility into more complex end-to-end data flows and identify all the involved downstream resources.

Option 2: Create a parallel data flow

Alternatively, you can create a new parallel data flow while keeping the original one running. Once the new data flow is done with reprocessing, you can make the switch over to the new data flow.

Reprocessing CDC flow by using new resources

To switch from the original to the new data flow:

Stop all the connections and pipelines in the original data flow.
Delete or rename the Postgres tables in the original data flow.
Stop the Postgres sink connection in the new data flow.
Rename the Postgres tables in the new data flow to their original name.
Update the stream mappings of the Postgres sink in the new data flow to refer to the updated table names.
Restart the Postgres sink.

Lineage View is helpful to get the proper visibility into more complex end-to-end data flows and identify all the involved downstream resources.

Common issues

Missing values for added column in some rows

This is because the column is added in the upstream table while the connection is running. The connection doesn’t automatically pick up the new column until the connection is restarted with the new field added to the mapped stream. However, the connection won’t backfill the new column on any records processed prior to that.

Mitigation: To solve this issue, reset the state and restart the connection to reprocess the data.

Records deleted upstream aren’t deleted downstream

If a source connection’s state is reset and restarted without first truncating downstream resources, then record deletions can be missed.

Resetting a connection’s state will cause it to snapshot all existing records in a table, and pick up all changes from that point in time onward. The connection loses context about previous records that are no longer in the table, and won’t issue new deletes for them.

Mitigation: You must manually truncate downstream resources prior to a restart, in order to ensure a clean state.

The same behavioral issue may apply if a stream mapping is first removed from the connection, then re-added. This will cause the connection to re-snapshot the table and treat it as new - possibly resulting in missed deletions. Downstream truncation is required when reprocessing a single table in this manner.

Downstream job failure due to deserialize error

This is either because a schema change wasn’t propagated, or because a schema change is incompatible with data already in the streams.

Schema change not propagated

If you make a schema change, Decodable resources won’t pick them up automatically. This may lead to deserialization failures.

Mitigation: To fix this, update the stream schema and restart the Decodable resource writing into it.

Schema change incompatible with existing data

If you make a backwards-incompatible schema change, you must clear streams with that schema. Otherwise, consumers of that stream will attempt to deserialize the old data with the new schema.

Mitigation: To prevent this, clear the stream before making any such changes. If this has already happened, the safest option is to reprocess data.