Apache Hudi sink connector

Apache Hudi is the next generation streaming data lake platform. Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Apache Flink, Presto, Trino, Apache Hive, etc.

If you are interested in a direct Decodable Connector for Hudi, contact support@decodable.co or join our Slack community and let us know!

Getting started

Sending a Decodable data stream to Hudi is accomplished in two stages, first by creating a sink connector to a data source that is supported by Hudi, and then by adding that data source to your Hudi configuration. Decodable and Hudi mutually support several technologies, including Apache Kafka.

Configure as a sink

This example demonstrates using Kafka as the sink from Decodable and the source for Hudi. Sign in to Decodable Web and follow the configuration steps provided in the Apache Kafka sink connector topic to create a sink connector. For examples of using the command line tools or scripting, see the How To guides.

Create Kafka data source

There are multiple ways of ingesting data streams into Hudi, including HoodieStreamer or Kafka Connect. For example, here are the steps for using Kafka Connect.

  1. Create the environment

  2. Set up the schema registry

  3. Create the Hudi Control Topic for coordination of the transactions

  4. Create the Hudi Topic for the Sink and insert data into the topic

  5. Run the Sink connector worker

  6. Add the Hudi Sink to the Connector

  7. Run async compaction and clustering if scheduled

  8. Query via Hive

For more detailed information, see Hudi’s Kafka Connect documentation.