About Change Data Capture and Change Streams

Making sure that your data remains up-to-date is one of the challenges typically associated with traditional database systems. One way to ensure that your data remains fresh, valuable, and relevant is through Change Data Capture (CDC).

Change Data Capture, or CDC for short, is a method that detects changes in a database and makes those changes available for further processing. Using Decodable, you can keep track of changes made to data, process those changes via pipelines, and publish them in a downstream destination. You can also trigger behavior in a downstream destination based on changes to the original database, such as a deleted account name or an updated inventory count. This downstream destination could be another database, a search index, a cache, or some other consumer system.

When you choose to use Decodable for your CDC use cases, you are also able to take advantage of the unique flexibility that Decodable offers. You can build workflows that are completely decoupled from the CDC source. For example, you can create multiple pipelines that process and transform data from the same CDC source and send that transformed data to different destinations. Or, you can use Decodable to join CDC data from multiple CDC sources and send that joined data to a downstream destination like ElasticSearch or to a data warehouse to maintain a materialized view.

In addition, Decodable guarantees that the data from CDC sources will be delivered to Decodable with exactly-once semantics. Data from Decodable to a CDC-compatible destination is delivered with either at-least-once or exactly-once semantics, ensuring that no records are missed or lost. For the exact delivery guarantee, see the documentation for the specific destination.

Overview

The following diagram shows a high level overview of how Decodable fits in with change data capture use cases.

cdcworkflow diagram
  1. Decodable uses the log-based change data capture method. Databases contain database logs, also known as transaction logs, that store all changes made to a database. These database changes are stored in their exact order of application which enables databases to be recovered when they crash.

  2. A Decodable connector reads the database changes — Create, Update, and Delete — from the database logs. Decodable connectors leverage Debezium, an open source change data capture platform, to read and interpret the database changes. Any connector that accepts Debezium-formatted records or any connector with the word "CDC" in its name is able to read, parse, and send records representing database change logs to a stream.

  3. The pipeline processes those records in the stream in real-time. Alternatively, if you don’t want to do any additional processing to your data, you can send records directly from a stream into an external downstream system via a connector.

  4. The connector sends the data to a downstream external system.

About Change Streams

A change stream is a type of stream comprising of change records, which are records containing both the modifications made to the database (inserts, updates, and deletes) as well as what the data looked like before and after modification.

Change streams provide a flexible, scalable way to stream data changes to other destinations. Common use cases include:

  • Replicate data changes to a data warehouse.

  • Perform real-time data enrichment by joining a stream with an external table.

  • Build an audit log so that you have a persistent trail of all changes to a database.

Change Record schema

Decodable processes CDC records by using connectors that support Debezium, an open source platform for handling change data capture records. Debezium provides a unified format schema for change records. Each change record has the following physical schema.

Field Description

op

Indicates what modification was performed in the database. One of the following: c: Create (insert). Insert a new record. u: Update. Update an existing record. d: Delete. Delete an existing record.

This field is also used by Decodable to know what modification to do in a downstream destination.

before

The old values of the row’s columns.

after

The new values of the row’s columns.

+

For change streams, you must specify a field in the change stream records' logical schema to use as the primary key. A primary key is a field that contains a value that can be used to unique identify each row in a table. Then, Decodable can use the primary key in order to interpret which modification, based on the op field, to perform. You must designate a primary key in order for Decodable to insert, update, or delete data from a downstream database table.

Physical Schema

In addition to logical schema, users can query the physical schema of a stream. The physical schema of an Append stream is the same as the the logical schema. The physical schema of a Change stream contains the Debezium schema wrapper as change streams are stored in the Debezium format internally. Physical schema is especially useful when to_append() is used to allow Process a change stream as an append stream.

For example if a change stream has two fields:

schema
   0  id              INTEGER PRIMARY KEY
   1  value           STRING

The physical schema of the stream would be

schema
   0  OP              STRING
   1  before          ROW<id: INTEGER, value: STRING>
   2  after           ROW<id: INTEGER, value: STRING>

Change Record examples

{"op":"c","before":null,"after":{"id":1,"user":"alice","status":"NEW","email":"alice@example.com"},"ts_ms":0}
{"op":"c","before":null,"after":{"id":2,"user":"alice","status":"NEW","email":"alice@example.com"},"ts_ms":1}
{"op":"c","before":null,"after":{"id":3,"user":"bob","status":"NEW","email":"bob@example.com"},"ts_ms":2}
{"op":"u","before":{"id":2,"user":"alice","status":"NEW"},"after" {"id":2,"user":"alice","status":"SHIPPED","email":"alice@example.com"},"ts_ms":3}
{"op":"u","before":{"id":2,"user":"alice","status":"SHIPPED"},"after" {"id":2,"user":"alice","status":"DELIVERED","email":"alice@example.com"},"ts_ms":4}
{"op":"d","before":{"id":2,"user":"alice","status":"DELIVERED"},"after": null,"ts_ms":4}
{"op":"c","before": null, "after": {"id":2,"user":"alice","status":"DELIVERED","email":"alice.li@example.com"}, "ts_ms":4}

Why does my data contain nulls?

Change records contain nulls for specific table modifications.

  • Records representing a create or insert modification ("op":"c") will have a null "before" value, since the record did not exist in the table before.

  • Records representing a delete modification ("op":"d") will have a null "after" value, since the record no longer exists in the table.

  • Records representing an update modification ("op":"u") will usually not contain any nulls, because they are updating an existing record. The exception is if there was a specific field in the record that was null before or after the modification.

Which connectors can handle change streams?

The following connectors can either produce or send change records in or from change streams.

Source connectors

The following data sources can produce change records which can then be sent to change streams.

See the relevant documentation for each connector for more information. For data sources not explicitly covered by the above connectors, you must use a pipeline to convert your append stream into a change stream. You can then send the data from that change stream into a compatible destination, ensuring that your data remains up-to-date across different systems. See Process an append stream as a change stream.

Sink connectors

Decodable can easily send change records to the following destinations. The connectors for these destinations are able to interpret the change records and publish the modifications described downstream.

See the relevant documentation for each connector for more information. If you want to send change records to a destination not on this list, you must use a pipeline to convert your change stream into an append stream. By doing so, your change records are sent to a destination without altering any of existing records stored in that destination. This is useful if you want to store a full history of changes somewhere. See Process a change stream as an append stream for more information.

Viewing change records

You can preview both the change records as well as the updated table by going into the detailed view of a change stream.

  1. From the Streams page, select the name of the change stream that you’d like to view.

  2. Select Run Preview to send a sample of records from the incoming data. These sample records are displayed in the Preview panel.

  3. Select the Table dropdown to toggle between the Table and Change. When you preview a change stream, you are able to see both the Debezium-formatted change records as well as the changed table.

As an example, let’s assume that you have the following Debezium-formatted change records. These records represent the status of an order made at your store.

{"before":null,"after":{"order_id":2,"order_status":"DELIVERED","order_update_ts":"2022-11-17 19:27:54","product_id":3,"product_name”:”Monopoly”,”price”:50,”user_id":2,"user_name”:”Michael”,”zip_code”:”94105},”op":"c"}

{"before":{"order_id":2,"order_status":"SHIPPED","order_update_ts":"2022-11-16 23:25:37","product_id":3,"product_name”:”Monopoly”,”price”:50,”user_id":2,"user_name”:”Michael,”zip_code":"94105"},"after":null,"op":"d"}

When you toggle to the Table view, you’ll see a preview of the changed table. Notice that the table only contains one entry representing the latest status for this specific order. However, you can also view the historical values for a given row in the table by selecting >.

cdc workflow example