Apache Pinot sink connector

Apache Pinot is a real-time distributed OLAP data store, purpose-built to provide ultra low-latency analytics, even at extremely high throughput. It can ingest directly from streaming data sources - such as Apache Kafka and Amazon Kinesis - and make the events available for querying instantly. It can also ingest from batch data sources such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage.

At the heart of the system is a columnar store, with several smart indexing and pre-aggregation techniques for low latency. This makes Pinot the most perfect fit for user-facing real-time analytics. At the same time, Pinot is also a great choice for other analytical use-cases, such as internal dashboards, anomaly detection, and ad-hoc data exploration.

Features

Delivery guarantee

Exactly once (but Pinot operates as at least once)

Getting started

There are two ways of ingesting data to Pinot from Decodable:

  • The Decodable Pinot Connector, which pushes a finished Pinot segment directly to Pinot at each internal Decodable checkpoint.

    • This option is best for proof of concept and certain moderate throughput use cases. It is simpler in that you can get nearly immediate results in Pinot without dealing with an intermediate streaming provider.

    • For sustained use it might require some additional Pinot table configuration.

  • Indirectly via an intermediary streaming technology supported by both Decodable and Pinot, such as Amazon Kinesis, Kafka, or Apache Pulsar.

    • This option allows you more control for high throughput, including how and when Pinot creates segments from incoming data. This is a more typical operating mode for Pinot, and may be better supported by Pinot. However, it requires you to configure and manage the streaming service and its topics and partitions.

Option 1: The Decodable Pinot connector

The Decodable Pinot Connector pushes one Pinot Segment per Decodable checkpoint.

Note that this Connector uses a checkpoint interval of 5 minutes, rather than the 10 seconds used by default for most Connectors and all Pipelines. This supports Pinot’s need for larger segment sizes at longer intervals. It may still be valuable to configure Pinot and your Pinot Table with a rollup task; see below.

Prerequisites

The following directions assume you have:

  • A Decodable Account.

  • A Pinot instance:

  • A Pinot Table of type OFFLINE, with a corresponding Pinot Schema that matches the schema of the Connection you’ll create here.

Optional Pinot rollup task

To ensure Pinot query performance over time, this Table may require configuration with a MergeRollupTask. This may or may not be required for your use case and Pinot provider.

We recommend daily rollup as a first step, but other rollup settings may be appropriate for your use case.

Note that the Pinot provider must be configured to support this table configuration; otherwise it will be ignored with no actual rollup occurring.

See Pinot documentation:

Steps

If you want to use the Decodable CLI or API to create the connection, you can refer to the Property column for information about what the underlying property names are. The connector name is pinot.
Property Disposition Description

url

required

Pinot Controller endpoint URL

table.name

required

Name of Pinot Table (without Type (OFFLINE))

table.type

optional

Typically OFFLINE, the default.
May be REALTIME in some advanced cases

auth.basic.username

required

Username for authentication to the Pinot Controller at url

auth.basic.password

required

Password to use with Username

Option 2: Indirect streaming through an external service

Sending a Decodable data stream to Pinot is accomplished in two stages, first by creating a sink connector to a data source that is supported by Pinot, and then by adding that data source to your Pinot configuration. Decodable and Pinot mutually support several technologies, including the following:

  • Amazon Kinesis

  • Kafka

  • Pulsar

Example: Use Kafka as a sink

This example demonstrates using Kafka as the sink from Decodable and the source for Pinot. Sign in to Decodable Web and follow the configuration steps provided for the Apache Kafka sink connector to create a sink connector. For examples of using the command line tools or scripting, see the How To guides.

Create Kafka data source in Pinot

Pinot has out-of-the-box real-time ingestion support for Kafka. Pinot lets users consume data from streams and push it directly into the database, in a process known as stream ingestion. Stream Ingestion makes it possible to query data within seconds of publication. Stream Ingestion provides support for checkpoints for preventing data loss. Setting up Stream ingestion involves the following steps:

  1. Create schema configuration. Schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions, metrics, or timestamp.

  2. Create table configuration. The real-time table configuration consists of the following fields:

    • tableName, the name of the table where the data should flow.

    • tableType, the internal type for the table. Should always be set to REALTIME for real-time ingestion.

    • tableIndexConfig, defines which column to use for indexing along with the type of index. It has the following required fields:

      • loadMode, specifies how the segments should be loaded. Should be heap or mmap

      • streamConfig, specifies the data source along with the necessary configurations to start consuming the real-time data. The streamConfig can be thought of as the equivalent to the job spec for batch ingestion.

  3. Upload table and schema spec. Once the table and schema configurations are created, they can be uploaded to the Pinot cluster. As soon as the configurations are uploaded, Pinot will start ingesting available records from the topic.

For more detailed information, see the Apache Kafka guide in the Apache Pinot documentation.