AWS S3 (Legacy)
Note: You are viewing documentation for the legacy version of the AWS S3 Connector. See AWS S3 for the latest version.
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can use Amazon S3 to store and protect any amount of data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements.
- Storage and access management
- Storage logging and monitoring
- Analytics and insights
- Strong consistency
Getting Started
Connections come in two flavors: source and sink. Source connections read from an external system and write to a Decodable stream, while sink connections read from a stream and write to an external system.
Amazon S3 connectors can only be used in the sink role.
Configure As A Sink
To create and configure a connector for S3, sign in to the Decodable Web Console, navigate to the Connections tab, click on New Connection
, and follow the steps below. For examples of using the command line tools or scripting, see the How To guides.
-
The connector type will default to
sink
, since that is the only option for S3 connectors. -
Specify the AWS region of your S3 bucket. If not specified, it will default to your Decodable Account region. For example,
us-west-2
. -
Specify the name of your S3 bucket.
-
Optionally provide your S3 object key (path) prefix. Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, you can use the folder concept as a means of grouping objects.
-
Specify the AWS ARN of the IAM role. For example,
arn:aws:iam::111222333444:role/decodable-s3-access
. -
Optionally provide a partition template. See below for details.
-
Select a data format used to deserialize and serialize the keys and values, which can be one of the following:
JSON
, the JSON format allows to read and write JSON data that is based on a JSON schema. When usingJSON
, you must also provide:- the format to be used when encoding timestamps as JSON strings, which can be either
SQL
orISO-8601
- whether or not to encode decimals as plain numbers
- the format to be used when encoding timestamps as JSON strings, which can be either
Parquet
, Apache parquet is a columnar storage format that is optimized for fast retrieval of data. When usingParquet
, you must also provide:- the compression algorithm to use for serialization, which can be one of the following:
none
,SNAPPY
,GZIP
, orLZO
- the compression algorithm to use for serialization, which can be one of the following:
Raw
, the Raw format allows to read and write raw (byte based) values as a single column.
For more detailed information about Amazon S3, see the S3 Getting Started guide and related documentation.
Reference
Connector name | s3 |
Type | sink |
Delivery guarantee | exactly once |
The S3 connector streams data to an S3 bucket in your AWS account. To use it, configure an AWS IAM Role as described below, with specific permissions to write to the bucket.
Properties
The following properties are supported by the S3 connector.
Property | Disposition | Description |
---|---|---|
region | optional | Region of S3 bucket. Defaults to Decodable Account region. Example: us-west-2 |
bucket | required | Name of S3 bucket. Example: my-bucket |
directory | optional | S3 key (path) prefix, used as directory (regardless of start/end slashes (/ )). |
format | required | Must be json , parquet , or raw . |
role-arn | required | AWS ARN of the IAM Role configured as described below. Example: arn:aws:iam::111222333444:role/decodable-s3-access . |
partition-template | optional | For value-based S3 object key partitioning. See below for details. |
JSON Format Properties
When using format
=json
, the following properties are optionally allowed:
Property | Disposition | Description |
---|---|---|
json.timestamp-format.standard | optional | Specify the timestamp format for TIMESTAMP and TIMESTAMP_LTZ types. Defaults to ISO-8601
|
json.encode.decimal-as-plain-number | optional | Must be true or false , defaults to false .When true , always encode numbers without scientific notation.For example, a number encoded 2.7E-8 by default would be encoded 0.000000027 . |
Parquet Format Properties
When using format
=parquet
, the following properties are optionally allowed:
Property | Description |
---|---|
parquet.compression | Options are SNAPPY , GZIP and LZO . Defaults to no compression. |
Other parquet options are also available. Refer to ParquetOutputFormat for more information.
S3 Object Key Formation
The S3 object parts are joined as <directory>/<partition-key>/<object-name>.<format>
The computed object-name
includes a (wallclock) timestamp in milliseconds, followed by a random string.
In the final computed S3 object key, any series of contiguous slashes, such as ///
, is reduced to a single slash /
.
IAM Role, Permissions, and Security
To be secure, you, AWS, and Decodable work together to ensure only S3 connections in your Decodable Account can put data to your S3 bucket.
How?
AWS IAM provides a special mechanism — called ExternalId
— that you and Decodable will use as described here, which ensures access from Decodable to your bucket happens only for your Decodable Account. Like this:
- You'll create and configure an IAM Role with two Policies:
- A Trust Policy allowing access from Decodable's AWS account — but only with an
ExternalId
matching your (unique) Decodable account name. - A Permissions Policy with the needed permissions on your bucket.
- A Trust Policy allowing access from Decodable's AWS account — but only with an
- You'll provide us the ARN of this Role via your Decodable S3 connection's
role-arn
property. - Our servers will assume that Role using an
ExternalId
value matching only your Decodable Account name — never any other. We'll use that to talk to your bucket.
Note that the values here are not treated as secret (by us, AWS, or you): not ExternalId
(your account name), not the Role ARN, not the bucket name.
Specifically, your IAM Role (per-roleArn
) must:
- have an
AssumeRole
Trust Policy that:- names Decodable's AWS account ID (
671293015970
) asPrincipal
. - has a
Condition
requiringsts:ExternalId
to equal your Decodable Account name.
- names Decodable's AWS account ID (
- have a Permissions Policy allowing needed operations on the bucket (not Role) ARN and (wildcardable) S3 key (path).
The Policy Actions are:s3:GetObject
s3:PutObject
s3:DeleteObject
s3:ListBucket
(on the bucket only)
For example
Here's an example IAM Trust Policy. Replace my-decodable-account
. Note that 671293015970
is Decodable's AWS account ID and must match exactly.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::671293015970:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "my-decodable-account"
}
}
}
]
}
Note: To allow several Decodable Accounts (say, in different AWS Regions) to write to the same bucket, use an array of Account names for the
ExternalId
value:
{ "sts:ExternalId": ["my-acct-1", "my-acct-2"] }
Here's an example IAM Permissions Policy. Replace your-bucket
(twice) and /some/dir
appropriately. Note that the path (here: /some/dir
) can be blank to put S3 objects to bucket root path, but the trailing /*
is required.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::your-bucket/some/dir/*"
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::your-bucket"
}
]
}
Further reading — from AWS
For full discussion from AWS of the security problem this solves, and its AWS-recommended solution using ExternalId
, we recommend reading:
AWS Identity and Access Management • The confused deputy problem.
Value-Based S3 Object Key Partitioning
You can partition the S3 Object key paths by value using the partition-template
connection property.
Each S3 object in a given partition (as expressed in the key) will have the same values for all referenced columns, with column references for TIMESTAMP
-type values extracted via a format argument with syntax for Java's DateTimeFormatter.
Column references are delimited by curly braces: { ... }
, with optional argument delimited by a colon :
. The argument is only used with references to TIMESTAMP
-type columns.
For example:
{some_value}/{another_value}
before/{some_value}/after
{a_timestamp:yyyy/MM/dd}
some_value={some_value}
(for hive){a_timestamp:'year'=yyyy/'month'=MM/'day'=dd}
(for hive){a_timestamp}
(same as above, the default)
Slashes at start and end are irrelevant in the final computed S3 object key.
The default TIMESTAMP
reference format argument is:
'year'=yyyy/'month'=MM/'day'=dd
The partition-template
property is optional, and defaults to an empty string.
Supported types
Only the following SQL data types are supported.
Type family | Types |
---|---|
Character String | STRING , CHAR , VARCHAR |
Integer Numeric | TINYINT , SMALLINT , INTEGER , BIGINT |
Timestamp | TIMESTAMP , TIMESTAMP_LTZ |
Examples with data
Given a connection/stream schema and a record with values:
Column | Type | Value |
---|---|---|
event_at | timestamp | 2022-01-02 03:04:05 UTC |
method | string | GET |
code | int | 418 |
The partition key would be generated as in the following examples.
partition-template | Generated partition key |
---|---|
{method} | GET |
before/{method}/after | before/GET/after |
method={method} | method=GET |
{method}/{code} | GET/418 |
method={method}/code={code} | method=GET/code=418 |
{event_at} | year=2022/month=01/day=02 |
{event_at:yyyy/MM/dd/HH/mm} | 2022/01/02/03/04 |
{event_at:'year'=yyyy/'month'=MM} | year=2022/month=01 |
Updated 4 months ago