Amazon S3 (Legacy) sink connector This connector has been deprecated, and will be removed in the future. Use the Amazon S3 sink connector connector for new S3 connections. The new connector includes support for configuring a file rollover policy and has enhanced security. Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can use Amazon S3 to store and protect any amount of data for a range of use cases, such as data lakes, websites, mobile applications, archive, enterprise applications, IoT devices, and big data analytics, in addition to backup and restore operations. Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements. Storage and access management Storage logging and monitoring Analytics and insights Strong consistency Getting started There are two types of Decodable connections: source connections and sink connections. Source connections read from an external system and write to a Decodable stream, while sink connections read from a stream and write to an external system. Delta Lake connectors can only be used in the sink role. Prerequisites Access to your AWS resources Decodable interacts with resources in AWS on your behalf. To do this you need an IAM role configured with a trust policy that allows access from Decodable’s AWS account, and a permission policy as detailed below. For more details on how this works, how to configure the trust policy, and example steps to follow see here. To use this connector you must associate a permissions policy with the IAM role. This policy must have the following permissions: Read/Write access to the S3 bucket path to which you’re writing data. s3:PutObject s3:GetObject s3:DeleteObject If you want to send data directly at the root level of the bucket, then leave the path blank with the trailing /* included. List access on the bucket to which you’re writing data s3:ListBucket Sample Permission Policy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject"], "Resource": "arn:aws:s3:::my_bucket/some/dir/*" }, { "Effect": "Allow", "Action": ["s3:ListBucket"], "Resource": "arn:aws:s3:::my_bucket" } ] } Configure as a sink To create and configure a connector for S3, sign in to Decodable Web, navigate to the Connections tab, click on New Connection, and follow the steps below. For examples of using the command line tools or scripting, see the How To guides. The connector type will default to sink, since that’s the only option for S3 connectors. Specify the AWS region of your S3 bucket. If not specified, it will default to your Decodable Account region. For example, us-west-2. Specify the name of your S3 bucket. Optionally provide your S3 object key (path) prefix. Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, you can use the folder concept as a means of grouping objects. Specify the AWS ARN of the IAM role. For example, arn:aws:iam::111222333444:role/decodable-s3-access. Optionally provide a partition template. Value-based S3 object key partitioning for details. Select a data format used to deserialize and serialize the keys and values, which can be one of the following: JSON, the JSON format allows to read and write JSON data that’s based on a JSON schema. When using JSON, you must also provide: the format to be used when encoding timestamps as JSON strings, which can be either SQL or ISO-8601 whether or not to encode decimals as plain numbers Parquet, Apache parquet is a columnar storage format that’s optimized for fast retrieval of data. When using Parquet, you must also provide: the compression algorithm to use for serialization, which can be one of the following: none, SNAPPY, GZIP, or LZO Raw, the Raw format allows to read and write raw (byte based) values as a single column. For more detailed information about Amazon S3, see the S3 Getting Started guide and related documentation. Reference Delivery guarantee Exactly once The S3 connector streams data to an S3 bucket in your AWS account. To use it, configure an AWS IAM Role as described below, with specific permissions to write to the bucket. Properties The following properties are supported by the S3 connector. Property Disposition Description region optional Region of S3 bucket. Defaults to Decodable Account region. Example: us-west-2 bucket required Name of S3 bucket. Example: my-bucket directory optional S3 key (path) prefix, used as directory (regardless of start/end slashes (/)). format required Must be json, parquet, or raw. role-arn required AWS ARN of the IAM Role configured as described below. Example: arn:aws:iam::111222333444:role/decodable-s3-access. partition-template optional For value-based S3 object key partitioning. Value-based S3 object key partitioning for details. JSON format properties When using format=json, the following properties are optionally allowed: Property Disposition Description json.timestamp-format.standard optional Specify the timestamp format for TIMESTAMP and TIMESTAMP_LTZ types. Defaults to ISO-8601 SQL will use a yyyy-MM-dd HH:mm:ss.SSS format, e.g "2020-12-30 12:13:14.123"ISO-8601 will parse input TIMESTAMP in yyyy-MM-ddTHH:mm:ss.SSS format, e.g "2020-12-30T12:13:14.123" json.encode.decimal-as-plain-number optional Must be true or false, defaults to false. When true, always encode numbers without scientific notation. For example, a number encoded 2.7E-8 by default would be encoded 0.000000027. Parquet format properties When using format=parquet, the following properties are optionally allowed: Property Description parquet.compression Options are SNAPPY, GZIP and LZO. Defaults to no compression. Other parquet options are also available. Refer to ParquetOutputFormat for more information. S3 object key formation The S3 object parts are joined as <directory>/<partition-key>/<object-name>.<format> The computed object-name includes a (wallclock) timestamp in milliseconds, followed by a random string. In the final computed S3 object key, any series of contiguous slashes, such as ///, is reduced to a single slash /. Value-based S3 object key partitioning You can partition the S3 Object key paths by value using the partition-template connection property. Each S3 object in a given partition (as expressed in the key) will have the same values for all referenced columns, with column references for TIMESTAMP-type values extracted via a format argument with syntax for Java’s DateTimeFormatter. Column references are delimited by curly braces: { ... }, with optional argument delimited by a colon :. The argument is only used with references to TIMESTAMP-type columns. For example: {some_value}/{another_value} before/{some_value}/after {a_timestamp:yyyy/MM/dd} some_value={some_value} (for hive) {a_timestamp:'year'=yyyy/'month'=MM/'day'=dd} (for hive) {a_timestamp} (same as above, the default) Slashes at start and end are irrelevant in the final computed S3 object key. The default TIMESTAMP reference format argument is: 'year'=yyyy/'month'=MM/'day'=dd The partition-template property is optional, and defaults to an empty string. Supported types Only the following SQL data types are supported. Type family Types Character String STRING, CHAR, VARCHAR Integer Numeric TINYINT, SMALLINT, INTEGER, BIGINT Timestamp TIMESTAMP, TIMESTAMP_LTZ Examples with data Given a connection/stream schema and a record with values: Column Type Value event_at timestamp 2022-01-02 03:04:05 UTC method string GET code int 418 The partition key would be generated as in the following examples. partition-template Generated partition key {method} GET before/{method}/after before/GET/after method={method} method=GET {method}/{code} GET/418 method={method}/code={code} method=GET/code=418 {event_at} year=2022/month=01/day=02 {event_at:yyyy/MM/dd/HH/mm} 2022/01/02/03/04 {event_at:'year'=yyyy/'month'=MM} year=2022/month=01