Amazon S3 source connector

Features

Connector name

s3-v2

Delivery guarantee

Exactly once

Supported task sizes

S, M, L, LM2, LM4 (without SQS)

M, L, LM2, LM4 (with SQS)

Multiplex capability

A single instance of this connector can read from all files in a single S3 bucket

Supported stream types

Append Stream

For better performance, enable an Amazon SQS (Simple Queue Service) queue so that the connector can receive notifications about when new files are written to the S3 bucket.

There is no defined order of ingestion for the files read from S3.

Configuration properties

Property Description Required Default

Property	Description	Required	Default
`region`	The AWS region that your S3 bucket is located in. For example, `us-west-2`.	—	If not specified, defaults to your Decodable account region.
`path`	The file path to the bucket or directory that you want to read data from. For example, `s3://bucket/directory`.	Yes
`role-arn`	The AWS ARN of the IAM role that has permissions to access the S3 bucket. For example, `arn:aws:iam::111222333444:role/decodable-s3-access`.	Yes
`format`	The format of data to read from S3. Must be one of the following: `json` `parquet` `avro`	Yes
`partition-cols`	The field names that partition the data. For example, if you want to partition your data based on the `datetime` field, then enter `datetime`.	—
JSON-specific configuration
`json.timestamp-format.standard`	Specify the timestamp format for `TIMESTAMP` and `TIMESTAMP_LTZ` types. `SQL` will use a `yyyy-MM-dd HH:mm:ss.SSS` format, e.g "2020-12-30 12:13:14.123" `ISO-8601` will parse input `TIMESTAMP` in `yyyy-MM-ddTHH:mm:ss.SSS` format, e.g "2020-12-30T12:13:14.123"	—	`SQL`
Reading data from S3
`source.monitor-interval`	(without SQS) How often, in seconds, to scan the S3 bucket for new files. (with SQS) How often, in seconds, events are polled from the queue.	—	`10`
`source.sqs-url`	The SQS queue URL.	Only applicable if you are connecting to an SQS-enabled S3 bucket.
`source.scan-on-startup`	Whether to scan the specified bucket for existing files, before relying on SQS for future file notifications. `true`: the connection scans the S3 bucket and ingests any historical data it contains, as well as all new data that arrives. `false`: only new data received after the SQS queue begins to receive file notifications is ingested.	true

region

The AWS region that your S3 bucket is located in. For example, us-west-2.

—

If not specified, defaults to your Decodable account region.

path

The file path to the bucket or directory that you want to read data from.

For example, s3://bucket/directory.

Yes

role-arn

The AWS ARN of the IAM role that has permissions to access the S3 bucket.

For example, arn:aws:iam::111222333444:role/decodable-s3-access.

Yes

format

The format of data to read from S3. Must be one of the following:

json
parquet
avro

Yes

partition-cols

The field names that partition the data.

For example, if you want to partition your data based on the datetime field, then enter datetime.

—

JSON-specific configuration

json.timestamp-format.standard

Specify the timestamp format for TIMESTAMP and TIMESTAMP_LTZ types.

SQL will use a yyyy-MM-dd HH:mm:ss.SSS format, e.g "2020-12-30 12:13:14.123"
ISO-8601 will parse input TIMESTAMP in yyyy-MM-ddTHH:mm:ss.SSS format, e.g "2020-12-30T12:13:14.123"

—

SQL

Reading data from S3

source.monitor-interval

(without SQS) How often, in seconds, to scan the S3 bucket for new files.
(with SQS) How often, in seconds, events are polled from the queue.

—

10

source.sqs-url

The SQS queue URL.

Only applicable if you are connecting to an SQS-enabled S3 bucket.

source.scan-on-startup

Whether to scan the specified bucket for existing files, before relying on SQS for future file notifications.

true: the connection scans the S3 bucket and ingests any historical data it contains, as well as all new data that arrives.
false: only new data received after the SQS queue begins to receive file notifications is ingested.

true

Prerequisites

Access to your AWS resources

Decodable interacts with resources in AWS on your behalf. To do this you need an IAM role configured with a trust policy that allows access from Decodable’s AWS account, and a permission policy as detailed below.

For more details on how this works, how to configure the trust policy, and example steps to follow see here.

To use this connector you must associate a permissions policy with the IAM role. This policy must have the following permissions:

Read access on the S3 bucket path from which you’re reading data.
```
s3:GetObject
```
If you want to read data directly at the root level of the bucket, then leave the path blank with the trailing /* included.
List access on the bucket from which you’re reading data
```
s3:ListBucket
```

Sample Permission Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::my_bucket"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my_bucket/some/dir/*"
    }

  ]
}

Detecting bucket changes with SQS

By default, the Amazon S3 connector monitors files in the specified bucket by scanning the entire bucket at an interval set by the user. However, for S3 buckets with a large number of files, this approach can be costly and inefficient. An alternative solution is to configure the Amazon S3 connector to detect and pick up new files using Amazon SQS. This queue must be configured to receive event notifications for new files from the S3 bucket.

When this option is enabled, the connector does an initial full bucket scan of the S3 bucket after it’s started for the first time or every time state is discarded on startup. Following this initial scan, the connector exclusively discovers new files to read from the file notifications arriving in the SQS queue. This option is recommended for long-lived and high-scale workloads.

During connection creation, you can disable the initial full bucket scan by setting the source.scan-on-startup property to false.

The following prerequisites apply if you want to get data from an SQS-enabled S3 bucket:

You must have a standard queue type with the following access policy.

{
  "Version": "2012-10-17",
  "Id": "example-ID",
  "Statement": [
    {
      "Sid": "example-statement-ID",
      "Effect": "Allow",
      "Principal": {
        "Service": "s3.amazonaws.com"
      },
      "Action": "SQS:SendMessage",
      "Resource": "arn:aws:sqs:<YOUR_SQS_QUEUE_ARN_HERE>",
      "Condition": {
        "ArnLike": {
          "aws:SourceArn": "arn:aws:s3:*:*:*"
        }
      }
    }
  ]
}

Update your IAM role to include a permissions policy for SQS.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage"
      ],
      "Resource": [
        "arn:aws:sqs:<YOUR_SQS_QUEUE_ARN_HERE>"
      ]
    }
  ]
}

Configure your S3 bucket to send event notification messages to the SQS queue you configured in Step 1 whenever new files arrive that match your path prefix.
1. When configuring event notifications, make sure that you select the check box for All object create events under Event Types. For the best performance, don’t select any other event types. See Enabling and configuring event notifications using the Amazon S3 console in the AWS documentation for information on how to enable notification messages.
  
  Once activated, this connection consumes and deletes messages from your queue during file processing. It’s not recommended for other applications to share this same SQS queue.

Connector starting state and offsets

When you create a connection it will by default read the entire contents of the S3 bucket. If you are using SQS you can customize this behavior with the configuration parameter source.scan-on-startup.

Learn more about starting state here.