A pipeline is a streaming SQL query that processes data from one (or more) input streams, and writes the results to an output stream. Pipelines must be activated to begin processing. Decodable ensures that your pipeline is fault-tolerant and delivers data exactly once by handling pipeline state management, consistent checkpointing and recovery behind the scenes.
Decodable uses SQL to process data, which should feel familiar to anyone who has used relational database systems. The primary differences you'll notice are that:
- You activate a pipeline to start it, and deactivate a pipeline to stop it.
- All pipeline queries specify a source and a sink.
- Certain operations - notably
JOINs and aggregations - must include windows.
Unlike relational databases, all pipelines write their results into an output (or sink). As a result, all pipelines are a single statement in the form
INSERT INTO <sink> SELECT ... FROM <source>, where sink and source are streams you've defined.
Check out the SQL Reference to help build your pipeline.
Pipelines use the same types as streams. See Streams - Data Types for more information.
Check out the Functions Reference for details of functions you can use in your pipeline SQL.
Pipelines process data exactly once. This means that, even in the case of temporary failures or pipeline restarts, every record will appear to be processed once and only once. For example, if you have a
COUNT() in your pipeline, it will never under- or over-count the records in its input stream, nor will it generate duplicates in its output stream. That said, some connectors may not guarantee exactly once delivery, in which case duplicate records may be introduced as data is moved on or off the Decodable platform via a connection. Check the connector documentation for more information.
Under the hood, Decodable uses a distributed checkpointing system that takes consistent snapshots of the pipeline's state at regular intervals. This includes which data has already been processed, intermediate calculation results, and many other details. When a pipeline recovers from a failure or restart, it loads the last successful snapshot, rewinds the stream to when the snapshot was taken, and begins processing again.
Certain kinds of pipelines are very sensitive to how time is handled. First, there are two notions of time to consider:
- Processing time is the "wall clock" time when the system processes the event.
- Event time is the timestamp present in the data itself. Typically, this is a field named
timestampor similar. It represents the time that the described event occurred or when it originated from an upstream system.
We recommend that pipelines use event time rather than processing time when performing
JOINs or aggregations over periods of time: we usually intend to group events based on when they happened rather than when they are processed.
Event time is specified on the input stream using the watermark flag. See the description of the stream watermark for details.
Updated 6 months ago