Apache Hudi is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.
Sending a Decodable data stream to Hudi is accomplished in two stages, first by creating a sink connector to a data source that is supported by Hudi, and then by adding that data source to your Hudi configuration. Decodable and Hudi mutually support several technologies, including Apache Kafka.
This example demonstrates using Kafka as the sink from Decodable and the source for Hudi. Sign in to the Decodable Web Console and follow the configuration steps provided for the Kafka Connector to create a
sink connector. For examples of using the command line tools or scripting, see the How To guides.
There are multiple ways of ingesting data streams into Hudi, including DeltaStreamer or Kafka Connect. For example, here are the steps for using Kafka Connect.
- Create the environment
- Set up the schema registry
- Create the Hudi Control Topic for coordination of the transactions
- Create the Hudi Topic for the Sink and insert data into the topic
- Run the Sink connector worker
- Add the Hudi Sink to the Connector
- Run async compaction and clustering if scheduled
- Query via Hive
For more detailed information, please refer to Hudi's Kafka Connect documentation.
Updated about 1 year ago