Use the Amazon S3 Connector to send data from Decodable to an Amazon S3 bucket. This updated version of the Amazon S3 Connector allows you to configure a file rollover policy and has enhanced security over the original Amazon S3 Connector.

Overview

Connector names3-v2
Typesink
Delivery guaranteeexactly once

Create a connection to Amazon S3 using the Amazon S3 Connector

Prerequisites

Before you can create the Amazon S3 connection, you must have an Identity and Access Management (IAM) role with the following policies. See the Setting up an IAM User section for more information.

  • A Trust Policy that allows access from Decodable’s AWS account. The ExternalId must match your Decodable account name.
  • A Permissions Policy with read and write permissions for the destination bucket.

Setting up an IAM User

The following is an example of what the Trust Policy should look like. Replace the <MY_DECODABLE_ACCOUNT_NAME> with your own. In the example, 671293015970 is Decodable’s AWS account ID and cannot be changed.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::671293015970:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<MY_DECODABLE_ACCOUNT_NAME>"
        }
      }
    }
  ]
}

Note: To allow several Decodable Accounts (say, in different AWS Regions) to write to the same bucket, use an array of Account names for the ExternalId value:

{ "sts:ExternalId": ["my-acct-1", "my-acct-2"] }

See AWS Identity and Access Management • The confused deputy problem for more information about why the ExternalId value is required.

You must also have read and write permissions on the destination S3 bucket. See the following list of permissions and replace <YOUR_BUCKET> and /some/dir appropriately. If you want to send data directly at the root level of the bucket, then leave the path blank with the trailing /* included.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::<YOUR_BUCKET>/some/dir/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::<YOUR_BUCKET>"
    }
  ]
}

Steps

  1. If you have an existing connection using the older version of the Amazon S3 Connector and would like to switch to the newest version, then do the following steps. If you are not upgrading an existing connection, skip these steps.
    1. Stop the existing Amazon S3 connection.
    2. Stop any pipelines that are using it.
  2. From the Connections page, select the Amazon S3 Connector and complete the following fields.
FieldDescription
AWS RegionThe AWS region that your S3 bucket is located in. If not specified, defaults to your Decodable account region. For example, us-west-2.
PathThe file path to the bucket or directory that you want to send data to.

For example, s3://bucket/directory.
IAM Role ARNThe AWS ARN of the IAM role.

For example, arn:aws:iam::111222333444:role/decodable-s3-access.
Partition TemplateThe field names that you want to use to partition your data.

For example, if you want to partition your data based on the datetime field, then enter datetime. See the S3 object key partitioning section for more information.
Value FormatThe format for data in the Amazon S3 destination. You can select one of the following: JSON, Parquet, or Raw.
Rolling Policy: File SizeThe maximum file size in Amazon S3. If a file reaches this maximum size while Decodable is streaming data to it, then the file closes and a new file with the same object prefix is created.
Rolling Policy: IntervalThe maximum amount of time that a file in an S3 bucket can stay open. If a file has been open for this length of time while Decodable is streaming data to it, then the file closes and a new file with the same object name prefix is created.
Timestamp FormatThe format to use when encoding timestamps as JSON strings. Only applicable when the Value Format is JSON.
CompressionThe compression algorithm to use for serialization. Only applicable when the Value Format is Parquet.
  1. Select which stream contains the records that you’d like to send to Amazon S3. Then, select Next.
  2. Give the newly created connection a Name and Description and select Save.
  3. If you are replacing an existing Amazon S3 connection, then restart any pipelines that were processing data for the previous connection.

Properties

The following properties are supported by the S3 connector.

PropertyRequired?Description
pathrequiredThe file path to the bucket or directory that you want to send data to.

For example, s3://bucket/directory
role-arnrequiredThe AWS ARN of the IAM role.
regionoptionalThe AWS Region that the S3 bucket is in.

Defaults to the Decodable Account region.
formatrequiredThe format for data in the Amazon S3 destination. You can select one of the following: JSON, Parquet, Raw.
partition-colsoptionalThe field names that you want to use to partition your data.

For example, if you want to partition your data based on the datetime field, then enter datetime. See the S3 object key partitioning section for more information.
sink.rolling-policy.file-sizeoptionalThe maximum file size in Amazon S3. If a file reaches this maximum size while Decodable is streaming data to it, then the file closes and a new file with the same object prefix is created.
sink.rolling-policy.rollover-intervaloptionalThe maximum amount of time that a file in an S3 bucket can stay open. If a file has been open for this length of time while Decodable is streaming data to it, then the file closes and a new file with the same object name prefix is created.
auto-compactionoptionalWhen enabled, compacts many small files into fewer large files. Defaults to false.
compaction.file-sizeoptionalThe target file size. This is the maximum file size that compacted files can be.

JSON Format Properties

The following properties are only applicable when format=json.

PropertyRequired?Description
json.timestamp-format.standardOptionalSpecify the timestamp format for TIMESTAMP and TIMESTAMP_LTZ types. Defaults to ISO-8601
  • SQL will use a "yyyy-MM-dd HH:mm:ss.SSS" format, e.g "2020-12-30 12:13:14.123"
  • ISO-8601 will parse input TIMESTAMP in "yyyy-MM-ddTHH:mm:ss.SSS" format, e.g "2020-12-30T12:13:14.123"
json.encode.decimal-as-plain-numberOptionalMust be true or false, defaults to false.
When true, always encode numbers without scientific notation.
For example, a number encoded 2.7E-8 by default would be encoded 0.000000027.

Parquet Format Properties

The following properties are only applicable when format=parquet.

PropertyDescription
parquet.compressionOptions are SNAPPY, GZIP and LZO. Defaults to no compression.

Other parquet options are available. See ParquetOutputFormat for more information.

S3 Object Key Partitioning

You can partition the S3 Object key paths by value using the partition-cols connection property. When you specify a field to partition with, that field will be used as a prefix to organize the data in the S3 bucket.

For example, a common pattern is to partition based on a date (e.g. 2023-01-01), and a subsequent partition for an hour of the day. Given a schema with datetime=DATE and hour=INTEGER fields, by setting the partition-cols property to be datetime,hour the resulting bucket entries will look like:

s3://my-path/datetime=2023-01-01/hour=0/
s3://my-path/datetime=2023-01-01/hour=1/
...
s3://my-path/datetime=2023-01-01/hour=23/
s3://my-path/datetime=2023-01-02/hour=0/

Note: The fields that are used as partition columns will be removed from the resulting payload in the file in Amazon S3. If you are using any query systems downstream that are relying on those fields, you will need to configure them to read the value from the file path instead.

S3 Object Key Formation

The following is an example of what your object keys look like in Amazon S3. Let’s assume that your Amazon S3 Connection has the following configuration:

  • S3 bucket or path: my-awesome-bucket
  • Format: JSON
  • Partition template or partition-cols: datetime
  • Sink.rolling-policy.file-size: 5 minutes
  • Compression is not set.

When you start the connection, the Amazon S3 Connector opens a file with a name like part-123e4567-e89b-12d3-a456-426614174000-0.json in the datetime=01-25-2023 subfolder in the my-awesome-bucket S3 bucket and starts streaming data to that file. Once 5 minutes have elapsed, then the part-123e4567-e89b-12d3-a456-426614174000-0.json file is closed and a new file named part-123e4567-e89b-12d3-a456-426614174000-1.json is opened. The Amazon S3 Connector then starts sending data to this newly opened file instead.

In summary, the S3 object parts are joined as: <path>/<partition-col>=<value>/part-<unique-id>-<N>