Delta Lake sink connector

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS), and HDFS. Specifically, Delta Lake offers:

  • ACID transactions on Spark

  • Scalable metadata handling

  • Streaming and batch unification

  • Schema enforcement

  • Upserts and deletes

Features

Delivery guarantee

Exactly once

The Delta Lake connector streams data in Delta Lake format to an S3 bucket in your AWS account. To use it, configure an AWS IAM Role as described below, with specific permissions to write to the bucket.

For more detailed information about configuring Delta Lake, see the Delta Lake Quickstart guide and related documentation.

Prerequisites

To be secure, you, AWS, and Decodable work together to ensure only Delta Lake connections in your Decodable Account can put data to your S3 bucket.

How?

AWS IAM provides a special mechanism (called ExternalId) that you and Decodable will use as described here, which ensures access from Decodable to your bucket happens only for your Decodable Account. Like this:

  • You’ll create and configure an IAM Role with two Policies:

    • A Trust Policy allowing access from Decodable’s AWS account—​but only with an ExternalId matching your (unique) Decodable account name.

    • A Permissions Policy with the needed permissions on your bucket.

  • You’ll provide us the ARN of this Role via your Decodable Delta Lake connection’s s3.role-arn property.

  • Our servers will assume that Role using an ExternalId value matching only your Decodable Account name—​never any other. We’ll use that to talk to your bucket.

Note that the values here aren’t treated as secret (by us, AWS, or you): not ExternalId (your account name), not the Role ARN, not the bucket name.

Specifically, your IAM Role (per-roleArn) must:

  • have an AssumeRole Trust Policy that:

    • names Decodable’s AWS account ID (671293015970) as Principal.

    • has a Condition requiring sts:ExternalId to equal your Decodable account name.

  • have a Permissions Policy allowing needed operations on the bucket (not Role) ARN and S3 key (path).
    The Policy Actions are:

    • s3:GetObject

    • s3:PutObject

    • s3:DeleteObject

    • s3:ListBucket

    • s3:PutObjectAcl

For full discussion from AWS of the security problem this solves, and its AWS-recommended solution using ExternalId, we recommend reading: AWS Identity and Access Management • The confused deputy problem.

Example trust policy

Here’s an example IAM Trust Policy. Replace my-decodable-account. Note that 671293015970 is Decodable’s AWS account ID and must match exactly.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::671293015970:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "my-decodable-account"
        }
      }
    }
  ]
}
To allow several Decodable Accounts (say, in different AWS Regions) to write to the same bucket, use an array of Account names for the ExternalId value:
{ "sts:ExternalId": ["my-acct-1", "my-acct-2"] }

Here’s an example IAM Permissions Policy. Replace your-bucket (twice) and /some/dir appropriately. Note that the path (here: /some/dir) can be blank to put S3 objects to bucket root path, but the trailing /* is required.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:PutObjectAcl", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::your-bucket/some/dir/*"]
    }
  ]
}

Connector properties

If you want to use the Decodable CLI or API to create the connection, you can refer to the Property column for information about what the underlying property names are. The connector name is delta-lake.
Property Disposition Description

table-path

required

Path to of S3 bucket using s3a scheme
Example: s3a://my-bucket/table_name

s3.role-arn

required

AWS ARN of the IAM Role configured as described below.
Example: arn:aws:iam::111222333444:role/decodable-delta-access.

Supported data types

The Delta Lake connector supports the data types listed here.