Schema migrations

You can update a stream’s schema or change the partition or primary key used in the stream from the Streams page in Decodable Web or using the Decodable CLI. Updating a schema is commonly known as a schema migration, since you are migrating the schema from one structure to another structure. Regardless of which tool you use to update the stream, there are three general workflows you can use depending on whether you want to prioritize convenience, correctness, or latency.

Updating an actively used stream affects the pipelines and connections that the stream is connected to. Depending on the change you are making to the stream, attached connections and pipelines can break and become incompatible.

The following table describes the methods that are available to you when you need to update a stream to perform a schema migration. There are different pros and cons to each method, so make sure to review the "When to use" column.

Methods When to use

Method 1: Clear the records in the stream and discard the state of the affected connections and pipelines.

- You are sending relatively low amounts of data.

- You care more about convenience than correctness or latency.

Method 2: Recreate all of the affected Decodable resources.

- You care more about correctness than convenience or latency.

- You are sending relatively low amounts of data.

- You want to change a stream’s primary key.

Method 3: Use a blue-green deployment strategy (best practice)

- You care equally about correctness and latency.

- You are sending medium to large amounts of data.

- You want to change a stream’s primary key

- You can tolerate the resource cost of temporarily having two environments live.

+

Method 1

If you want to update a stream with convenience prioritized, then do the following steps.

  1. Stop any connections or pipelines that are attached to the stream that you want to update.

  2. Clear any records that are either in the stream that you want to update or in a stream that is connected to it. In Decodable Web, select the dropdown menu (…​) and then select Clear.image::streams/schema_migrations_clear.webp[This image shows how to navigate to the Clear button to clear a stream.]

  3. Make the desired updates to the stream.

  4. Restart any connections and pipelines that you stopped in Step 1. If you are using the Decodable CLI, make sure the --force flag is set. If you are using Decodable Web, then make sure Discard State is selected.

Method 2

The second workflow, which prioritizes correctness, involves recreating the attached connections, pipelines, and streams from scratch. Best practices are to do this second workflow if you want to change a stream’s partition key. Do the following steps.

  1. Stop any connections or pipelines that are attached to the stream that you want to update.

  2. Clone all of the resources.

    1. Clone the streams with a new name. Make the changes that you wanted to make in the new stream.

    2. Clone the connections. Make sure that the cloned connections are attached to the newly cloned stream.

    3. Clone the pipelines. You must edit the cloned pipelines so that their SQL references the newly cloned stream name.

  3. Delete the older connection(s), pipeline(s), and stream(s). This is an optional clean-up step.

Method 3

The third workflow, which prioritizes both correctness and latency, employs a blue-green deployment strategy to perform schema migrations. This workflow involves recreating not only the attached connections, pipelines, and streams from scratch but also creating a new resource in the downstream destination that the recreated Decodable resources are connected to. This means that you will temporarily have two environments:

  • A blue environment where the data flowing through the connected resources contain the old schema.

  • A green environment where the data flowing through the connected resources contain the new schema.

Best practices are to do this third workflow if you are sending medium to large amounts of data to minimize risk and downtime. However, be aware that this deployment strategy means that two environments will be temporarily live at the same time, meaning double the resources and cost. Do the following steps.

  1. Set up the green environment. This is the environment where data with the updated schema will flow through.

    1. Create a new resource in the destination that you want to send data to. For example, if you want to perform a schema migration and your end-to-end workflow uses MySQL as a source and Elasticsearch as a destination, then you will need to either create a new Elasticsearch instance or index to send the updated data to.

    2. Clone the connections, streams, and pipelines from the blue environment. Make the changes that you want to the cloned stream, and make sure the attached resources are aware of the changes.

    3. Start the connections and pipelines in the blue environment.

    4. At this point, you should have a complete end-to-end green environment where the data flowing through contains the new schema.

  2. Wait until the new green environment has caught up with the old blue environment. You can determine this in one of the following ways.

    1. The most reliable way to determine this is by checking in your downstream system if the data produced by the green environment equals the data in the old blue environment. For append-only workloads, this can be as simple as checking the counts of records, entries, or events downstream.

    2. Alternatively, you can also check the backlog of unprocessed records in Decodable. You can get this metric by navigating to the Connections page, selecting the sink connection associated with the green environment, and viewing the Total unconsumed records metric. If you are monitoring the _metrics stream, this metric corresponds to records_lag_total.

  3. When the green environment has caught up with the blue environment, switch any of your services or tasks that are running in the downstream blue environment to the green environment. For example, if you are using Elasticsearch as the destination, make sure that any production searches referencing the old blue environment are now referencing the new green environment.

  4. Shut down the blue environment by deactivating the connected resources in it.

Using this migration strategy, you have updated the schema of your streaming workload with minimal disruption of your production workloads. The next time that you need to perform a schema migration, you can use the blue environment to perform the schema changes and make the blue environment the new production environment. With the blue-green schema migration model, you are always switching between the blue and green environments whenever you need to make a change to your streaming workflow.