Apache Pinot sink connector

Features

Connector name

pinot

Delivery guarantee

Exactly once (but Pinot operates as at least once)

Supported task sizes

S, M, L

Multiplex capability

A single instance of this connector can write to a single Pinot table

Supported stream types

The connector pushes a finished Pinot segment directly to Pinot at each internal Decodable checkpoint. Note that this Connector uses a checkpoint interval of 5 minutes, rather than the 10 seconds used by default for most Connectors and all Pipelines. This supports Pinot’s need for larger segment sizes at longer intervals.

This option is best for proof of concept and certain moderate throughput use cases. It’s simpler in that you can get near-instant results in Pinot without dealing with an intermediate streaming provider, which is the other approach. For sustained use it might require some additional Pinot table configuration.

It may still be valuable to configure Pinot and your Pinot Table with a rollup task.

Configuration properties

Property Description Required Default

url

Pinot Controller endpoint URL

Yes

auth.basic.username

Username for authentication to the Pinot Controller at url

You must specify either

  • auth.basic.username and auth.basic.password

or

  • auth.basic.token

auth.basic.password

Password to use with Username

auth.basic.token

The Token with which to connect to the Pinot Controller API.

table.name

Name of Pinot Table (without Type (OFFLINE))

Yes

table.type

Typically OFFLINE, the default.

May be REALTIME in some advanced cases

Yes

Prerequisites

The Pinot instance should have a Controller API endpoint accessible from the accessible from the Decodable network. Connectivity options include AWS PrivateLink, SSH tunnels, and allowing connections from the Decodable published IP addresses. It should be configured for authentication with username and password.

The Pinot table should be of type OFFLINE, with a corresponding Pinot Schema that matches the schema of the connection that you create in Decodable.

Optional Pinot rollup task

To ensure Pinot query performance over time, this Table may require configuration with a MergeRollupTask. This may or may not be required for your use case and Pinot provider.

We recommend daily rollup as a first step, but other rollup settings may be appropriate for your use case.

Note that the Pinot provider must be configured to support this table configuration, otherwise it will be ignored with no actual rollup occurring.

For more details on rollup tasks, see the Pinot documentation:

Ingesting data from Decodable to Pinot via an intermediary streaming technology

Instead of using the Decodable connector for Pinot, you may opt to use a streaming technology supported by both Decodable and Pinot. These are:

This option allows you more control for high throughput, including how and when Pinot creates segments from incoming data. This is a more typical operating mode for Pinot, and may be better supported by Pinot. However, it requires you to configure and manage the streaming service and its topics and partitions.

Example: Streaming data from Decodable to Pinot via Apache Kafka

Pinot has out-of-the-box real-time ingestion support for Kafka which enables users to consume data from streams and push it directly into the database, in a process known as stream ingestion. Stream ingestion makes it possible to query data within seconds of publication and provides support for checkpoints for preventing data loss.

  1. Create a Kafka sink connection for your Decodable stream

  2. Set up stream ingestion for Pinot as follows:

    1. Create schema configuration. Schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions, metrics, or timestamp.

    2. Create table configuration. The real-time table configuration consists of the following fields:

      • tableName, the name of the table where the data should flow.

      • tableType, the internal type for the table. Should always be set to REALTIME for real-time ingestion.

      • tableIndexConfig, defines which column to use for indexing along with the type of index. It has the following required fields:

        • loadMode, specifies how the segments should be loaded. Should be heap or mmap

        • streamConfig, specifies the data source along with the necessary configurations to start consuming the real-time data. The streamConfig can be thought of as the equivalent to the job spec for batch ingestion.

    3. Upload table and schema spec. Once the table and schema configurations are created, they can be uploaded to the Pinot cluster. As soon as the configurations are uploaded, Pinot will start ingesting available records from the topic.

For more detailed information, see the Apache Kafka guide in the Pinot documentation.

Connector starting state and offsets

A new sink connection will start reading from the Latest point in the source Decodable streams. This means that only data that’s written to the stream when the connection has started will be sent to the external system. You can override this when you start the connection to Earliest if you want to send all the existing data on the source streams to the target system, along with all new data that arrives on the streams.

When you restart a sink connection it will continue to read data from the point it most recently stored in the checkpoint before the connection stopped. You can also opt to discard the connection’s state and restart it afresh from Earliest or Latest as described above.

Learn more about starting state here.