A List of S3A Configuration Properties
The following fs.s3aconfiguration
properties are available. To override these default s3a settings, add your
configuration to your
core-site.xml.
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID used by S3A file system. Omit for IAM role-based or provider-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key used by S3A file system. Omit for IAM role-based or provider-based authentication.</description>
</property>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<description>
Comma-separated class names of credential provider classes which implement
com.amazonaws.auth.AWSCredentialsProvider.
These are loaded and queried in sequence for a valid set of credentials.
Each listed class must implement one of the following means of
construction, which are attempted in order:
1. a public constructor accepting java.net.URI and
org.apache.hadoop.conf.Configuration,
2. a public static method named getInstance that accepts no
arguments and returns an instance of
com.amazonaws.auth.AWSCredentialsProvider, or
3. a public default constructor.
Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
anonymous access to a publicly accessible S3 bucket without any credentials.
Please note that allowing anonymous access to an S3 bucket compromises
security and therefore is unsuitable for most use cases. It can be useful
for accessing public data sets without requiring AWS credentials.
If unspecified, then the default list of credential provider classes,
queried in sequence, is:
1. org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider: supports static
configuration of AWS access key ID and secret access key. See also
fs.s3a.access.key and fs.s3a.secret.key.
2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
configuration of AWS access key ID and secret access key in
environment variables named AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
of instance profile credentials if running in an EC2 VM.
</description>
</property>
<property>
<name>fs.s3a.session.token</name>
<description>Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as one of the providers.
</description>
</property>
<property>
<name>fs.s3a.security.credential.provider.path</name>
<value/>
<description>
Optional comma separated list of credential providers, a list
which is prepended to that set in hadoop.security.credential.provider.path
</description>
</property>
<property>
<name>fs.s3a.assumed.role.arn</name>
<value/>
<description>
AWS ARN for the role to be assumed.
Required if the fs.s3a.aws.credentials.provider contains
org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
</description>
</property>
<property>
<name>fs.s3a.assumed.role.session.name</name>
<value/>
<description>
Session name for the assumed role, must be valid characters according to
the AWS APIs.
Only used if AssumedRoleCredentialProvider is the AWS credential provider.
If not set, one is generated from the current Hadoop/Kerberos username.
</description>
</property>
<property>
<name>fs.s3a.assumed.role.policy</name>
<value/>
<description>
JSON policy to apply to the role.
Only used if AssumedRoleCredentialProvider is the AWS credential provider.
</description>
</property>
<property>
<name>fs.s3a.assumed.role.session.duration</name>
<value>30m</value>
<description>
Duration of assumed roles before a refresh is attempted.
Only used if AssumedRoleCredentialProvider is the AWS credential provider.
Range: 15m to 1h
</description>
</property>
<property>
<name>fs.s3a.assumed.role.sts.endpoint</name>
<value/>
<description>
AWS Simple Token Service Endpoint. If unset, uses the default endpoint.
Only used if AssumedRoleCredentialProvider is the AWS credential provider.
</description>
</property>
<property>
<name>fs.s3a.assumed.role.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
<description>
List of credential providers to authenticate with the STS endpoint and
retrieve short-lived role credentials.
Only used if AssumedRoleCredentialProvider is the AWS credential provider.
If unset, uses "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider".
</description>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>15</value>
<description>Controls the maximum number of simultaneous connections to S3.</description>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
<description>Enables or disables SSL connections to S3.</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<description>AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
</description>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>false</value>
<description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
</description>
</property>
<property>
<name>fs.s3a.proxy.host</name>
<description>Hostname of the (optional) proxy server for S3 connections.</description>
</property>
<property>
<name>fs.s3a.proxy.port</name>
<description>Proxy server port. If this property is not set
but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
the value of fs.s3a.connection.ssl.enabled).
</description>
</property>
<property>
<name>fs.s3a.proxy.username</name>
<description>Username for authenticating with proxy server.</description>
</property>
<property>
<name>fs.s3a.proxy.password</name>
<description>Password for authenticating with proxy server.</description>
</property>
<property>
<name>fs.s3a.proxy.domain</name>
<description>Domain for authenticating with proxy server.</description>
</property>
<property>
<name>fs.s3a.proxy.workstation</name>
<description>Workstation for authenticating with proxy server.</description>
</property>
<property>
<name>fs.s3a.attempts.maximum</name>
<value>20</value>
<description>How many times we should retry commands on transient errors.</description>
</property>
<property>
<name>fs.s3a.connection.establish.timeout</name>
<value>5000</value>
<description>Socket connection setup timeout in milliseconds.</description>
</property>
<property>
<name>fs.s3a.connection.timeout</name>
<value>200000</value>
<description>Socket connection timeout in milliseconds.</description>
</property>
<property>
<name>fs.s3a.socket.send.buffer</name>
<value>8192</value>
<description>Socket send buffer hint to amazon connector. Represented in bytes.</description>
</property>
<property>
<name>fs.s3a.socket.recv.buffer</name>
<value>8192</value>
<description>Socket receive buffer hint to amazon connector. Represented in bytes.</description>
</property>
<property>
<name>fs.s3a.paging.maximum</name>
<value>5000</value>
<description>How many keys to request from S3 when doing
directory listings at a time.
</description>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>10</value>
<description>The total number of threads available in the filesystem for data
uploads *or any other queued filesystem operation*.
</description>
</property>
<property>
<name>fs.s3a.threads.keepalivetime</name>
<value>60</value>
<description>Number of seconds a thread can be idle before being
terminated.
</description>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>5</value>
<description>The number of operations which can be queued for execution</description>
</property>
<property>
<name>fs.s3a.multipart.size</name>
<value>100M</value>
<description>How big (in bytes) to split upload or copy operations up into.
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
</description>
</property>
<property>
<name>fs.s3a.multipart.threshold</name>
<value>2147483647</value>
<description>How big (in bytes) to split upload or copy operations up into.
This also controls the partition size in renamed files, as rename() involves
copying the source file(s).
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
</description>
</property>
<property>
<name>fs.s3a.multiobjectdelete.enable</name>
<value>true</value>
<description>When enabled, multiple single-object delete requests are replaced by
a single 'delete multiple objects'-request, reducing the number of requests.
Beware: legacy S3-compatible object stores might not support this request.
</description>
</property>
<property>
<name>fs.s3a.acl.default</name>
<description>Set a canned ACL for newly created and copied objects. Value may be Private,
PublicRead, PublicReadWrite, AuthenticatedRead, LogDeliveryWrite, BucketOwnerRead,
or BucketOwnerFullControl.
</description>
</property>
<property>
<name>fs.s3a.multipart.purge</name>
<value>false</value>
<description>True if you want to purge existing multipart uploads that may not have been
completed/aborted correctly. The corresponding purge age is defined in
fs.s3a.multipart.purge.age.
If set, when the filesystem is instantiated then all outstanding uploads
older than the purge age will be terminated -across the entire bucket.
This will impact multipart uploads by other applications and users. so should
be used sparingly, with an age value chosen to stop failed uploads, without
breaking ongoing operations.
</description>
</property>
<property>
<name>fs.s3a.multipart.purge.age</name>
<value>86400</value>
<description>Minimum age in seconds of multipart uploads to purge
on startup if "fs.s3a.multipart.purge" is true
</description>
</property>
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<description>Specify a server-side encryption algorithm for s3a: file system.
Unset by default. It supports the following values: 'AES256' (for SSE-S3),
'SSE-KMS' and 'SSE-C'.
</description>
</property>
<property>
<name>fs.s3a.server-side-encryption.key</name>
<description>Specific encryption key to use if fs.s3a.server-side-encryption-algorithm
has been set to 'SSE-KMS' or 'SSE-C'. In the case of SSE-C, the value of this property
should be the Base64 encoded key. If you are using SSE-KMS and leave this property empty,
you'll be using your default's S3 KMS key, otherwise you should set this property to
the specific KMS key id.
</description>
</property>
<property>
<name>fs.s3a.signing-algorithm</name>
<description>Override the default signing algorithm so legacy
implementations can still be used
</description>
</property>
<property>
<name>fs.s3a.block.size</name>
<value>32M</value>
<description>Block size to use when reading files using s3a: file system.
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
</description>
</property>
<property>
<name>fs.s3a.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3a</value>
<description>Comma separated list of directories that will be used to buffer file
uploads to.
</description>
</property>
<property>
<name>fs.s3a.fast.upload.buffer</name>
<value>disk</value>
<description>
The buffering mechanism to for data being written.
Values: disk, array, bytebuffer.
"disk" will use the directories listed in fs.s3a.buffer.dir as
the location(s) to save data prior to being uploaded.
"array" uses arrays in the JVM heap
"bytebuffer" uses off-heap memory within the JVM.
Both "array" and "bytebuffer" will consume memory in a single stream up to the number
of blocks set by:
fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks.
If using either of these mechanisms, keep this value low
The total number of threads performing work across all threads is set by
fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the number of queued
work items.
</description>
</property>
<property>
<name>fs.s3a.fast.upload.active.blocks</name>
<value>4</value>
<description>
Maximum Number of blocks a single output stream can have
active (uploading, or queued to the central FileSystem
instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
</description>
</property>
<property>
<name>fs.s3a.readahead.range</name>
<value>64K</value>
<description>Bytes to read ahead during a seek() before closing and
re-opening the S3 HTTP connection. This option will be overridden if
any call to setReadahead() is made to an open stream.
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
</description>
</property>
<property>
<name>fs.s3a.user.agent.prefix</name>
<value></value>
<description>
Sets a custom value that will be prepended to the User-Agent header sent in
HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header
always includes the Hadoop version number followed by a string generated by
the AWS SDK. An example is "User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6".
If this optional property is set, then its value is prepended to create a
customized User-Agent. For example, if this configuration property was set
to "MyApp", then an example of the resulting User-Agent would be
"User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6".
</description>
</property>
<property>
<name>fs.s3a.metadatastore.authoritative</name>
<value>false</value>
<description>
When true, allow MetadataStore implementations to act as source of
truth for getting file status and directory listings. Even if this
is set to true, MetadataStore implementations may choose not to
return authoritative results. If the configured MetadataStore does
not support being authoritative, this setting will have no effect.
</description>
</property>
<property>
<name>fs.s3a.metadatastore.impl</name>
<value>org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore</value>
<description>
Fully-qualified name of the class that implements the MetadataStore
to be used by s3a. The default class, NullMetadataStore, has no
effect: s3a will continue to treat the backing S3 service as the one
and only source of truth for file and directory metadata.
</description>
</property>
<property>
<name>fs.s3a.s3guard.cli.prune.age</name>
<value>86400000</value>
<description>
Default age (in milliseconds) after which to prune metadata from the
metadatastore when the prune command is run. Can be overridden on the
command-line.
</description>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.region</name>
<value></value>
<description>
AWS DynamoDB region to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the S3Guard will operate table in the associated S3 bucket region.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.table</name>
<value></value>
<description>
The DynamoDB table name to operate. Without this property, the respective
S3 bucket name will be used.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.table.create</name>
<value>false</value>
<description>
If true, the S3A client will create the table if it does not already exist.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.table.capacity.read</name>
<value>500</value>
<description>
Provisioned throughput requirements for read operations in terms of capacity
units for the DynamoDB table. This config value will only be used when
creating a new DynamoDB table, though later you can manually provision by
increasing or decreasing read capacity as needed for existing tables.
See DynamoDB documents for more information.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.table.capacity.write</name>
<value>100</value>
<description>
Provisioned throughput requirements for write operations in terms of
capacity units for the DynamoDB table. Refer to related config
fs.s3a.s3guard.ddb.table.capacity.read before usage.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.max.retries</name>
<value>9</value>
<description>
Max retries on batched DynamoDB operations before giving up and
throwing an IOException. Each retry is delayed with an exponential
backoff timer which starts at 100 milliseconds and approximately
doubles each time. The minimum wait before throwing an exception is
sum(100, 200, 400, 800, .. 100*2^N-1 ) == 100 * ((2^N)-1)
So N = 9 yields at least 51.1 seconds (51,100) milliseconds of blocking
before throwing an IOException.
</description>
</property>
<property>
<name>fs.s3a.s3guard.ddb.background.sleep</name>
<value>25</value>
<description>
Length (in milliseconds) of pause between each batch of deletes when
pruning metadata. Prevents prune operations (which can typically be low
priority background operations) from overly interfering with other I/O
operations.
</description>
</property>
<property>
<name>fs.s3a.retry.limit</name>
<value>${fs.s3a.attempts.maximum}</value>
<description>
Number of times to retry any repeatable S3 client request on failure,
excluding throttling requests.
</description>
</property>
<property>
<name>fs.s3a.retry.interval</name>
<value>500ms</value>
<description>
Interval between attempts to retry operations for any reason other
than S3 throttle errors.
</description>
</property>
<property>
<name>fs.s3a.retry.throttle.limit</name>
<value>${fs.s3a.attempts.maximum}</value>
<description>
Number of times to retry any throttled request.
</description>
</property>
<property>
<name>fs.s3a.retry.throttle.interval</name>
<value>1000ms</value>
<description>
Interval between retry attempts on throttled requests.
</description>
</property>
<property>
<name>fs.s3a.committer.name</name>
<value>file</value>
<description>
Committer to create for output to S3A, one of:
"file", "directory", "partitioned", "magic".
</description>
</property>
<property>
<name>fs.s3a.committer.magic.enabled</name>
<value>false</value>
<description>
Enable support in the filesystem for the S3 "Magic" committer.
When working with AWS S3, S3Guard must be enabled for the destination
bucket, as consistent metadata listings are required.
</description>
</property>
<property>
<name>fs.s3a.committer.threads</name>
<value>8</value>
<description>
Number of threads in committers for parallel operations on files
(upload, commit, abort, delete...)
</description>
</property>
<property>
<name>fs.s3a.committer.staging.tmp.path</name>
<value>tmp/staging</value>
<description>
Path in the cluster filesystem for temporary data.
This is for HDFS, not the local filesystem.
It is only for the summary data of each file, not the actual
data being committed.
Using an unqualified path guarantees that the full path will be
generated relative to the home directory of the user creating the job,
hence private (assuming home directory permissions are secure).
</description>
</property>
<property>
<name>fs.s3a.committer.staging.unique-filenames</name>
<value>true</value>
<description>
Option for final files to have a unique name through job attempt info,
or the value of fs.s3a.committer.staging.uuid
When writing data with the "append" conflict option, this guarantees
that new data will not overwrite any existing data.
</description>
</property>
<property>
<name>fs.s3a.committer.staging.conflict-mode</name>
<value>fail</value>
<description>
Staging committer conflict resolution policy.
Supported: "fail", "append", "replace".
</description>
</property>
<property>
<name>fs.s3a.committer.staging.abort.pending.uploads</name>
<value>true</value>
<description>
Should the staging committers abort all pending uploads to the destination
directory?
Changing this if more than one partitioned committer is
writing to the same destination tree simultaneously; otherwise
the first job to complete will cancel all outstanding uploads from the
others. However, it may lead to leaked outstanding uploads from failed
tasks. If disabled, configure the bucket lifecycle to remove uploads
after a time period, and/or set up a workflow to explicitly delete
entries. Otherwise there is a risk that uncommitted uploads may run up
bills.
</description>
</property>
<property>
<name>fs.s3a.list.version</name>
<value>2</value>
<description>
Select which version of the S3 SDK's List Objects API to use. Currently
support 2 (default) and 1 (older API).
</description>
</property>
<property>
<name>fs.s3a.etag.checksum.enabled</name>
<value>false</value>
<description>
Should calls to getFileChecksum() return the etag value of the remote
object.
WARNING: if enabled, distcp operations between HDFS and S3 will fail unless
-skipcrccheck is set.
</description>
</property>
