Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?
NickName:Adi C Ask DateTime:2019-12-12T05:10:32

Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.

We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.

SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
        at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
        at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 24 more

Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration? Changing the code is not an option we entertain as it would involve quite a lot of testing.

The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.

EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.

I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths. Works fine, but would be nice if this was a config option.

def massage_paths(paths):
    if isinstance(paths, basestring):
        return 's3a' + x[2:] if x.startswith('s3:') else x
    if isinstance(paths, list):
        t = list
    else:
        t = tuple
    return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])

Copyright Notice:Content Author:「Adi C」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/59294131/can-you-translate-or-alias-s3-to-s3a-in-spark-hadoop

Answers
stevel 2019-12-17T17:22:11

cricket007 is correct. \n\nspark.hadoop.fs.s3.impl org.apache.fs.s3a.S3AFileSystem\n\n\nThere's some code in org.apache.hadoop.FileSystem which looks up from a schema \"s3\" to an implementation class, loads it and instantiates it with the full URL. \n\nWarning There's no specific code in the core S3A FS which looks for an FS schema being s3a, but you will encounter problems if you use the DynamoDB consistency layer \"S3Guard\" -that's probably a bit of overkill someone could fix",


OneCricketeer 2019-12-13T03:44:18

Ideally, you could refactor the code to detect the runtime environment, or externalize the paths to a config file that could be used in the respective areas.\n\nOtherwise, you would need to edit the hdfs-site.xml to configure the fs.s3a.impl key to rename s3a to s3, and you might be able to keep the value the same. That change would need done for all Spark workers ",


More about “Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?” related questions

Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Win...

Show Detail

Can't read from S3 bucket with s3 protocol, s3a only

I've been through all the threads on the dependencies for connecting spark running on an aws EMR to an s3 bucket, however my issue seems to be slightly different. In all of the other discussions I ...

Show Detail

Spark submit REST cluster/standalone mode - launching an s3a jar with STS

/var/lib/spark-2.3.4-bin-hadoop2.7/bin/spark-submit --master spark://myhost:6066 --conf spark.hadoop.fs.s3a.access.key='redact1' --conf spark.executorEnv.AWS_ACCESS_KEY_ID='redact1' --conf spark.

Show Detail

How to setup PySpark to locally read data from S3 using Hadoop?

I followed this blog post which suggests using: from pyspark import SparkConf from pyspark.sql import SparkSession conf = SparkConf() conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws...

Show Detail

spark aws S3a ARN (Amazon Resource Name) IAM role

I'm using spark 2.3.0 and Hadoop 2.7 ( but i can upgrade if necessary) I want acces S3 file with an ARN (Amazon Resource Name) IAM Role https://docs.aws.amazon.com/cli/latest/userguide/cli-multiple-

Show Detail

Configure EMR to use s3a instead of s3 for spark.sql calls

All my calls to spark.sql("") fails with the error in the stacktrace (1) below Update - 2 I have zeroed in on the problem, it is AccessDenied for sts:AssumeRule, any leads appreciated User: arn:a...

Show Detail

Spark Thrift Server connection to S3

I'm trying to connect to AWS S3 using Spark Thrift Service. I'm using: spark-defaults.conf spark.sql.warehouse.dir s3://demo-metastore-001/ spark.hadoop.fs.s3.impl org.apache.hadoo...

Show Detail

Spark History Server on S3A FileSystem: ClassNotFoundException

Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem. By adding the following into the conf/spark-defaults.conf, I can get spark-shell to log to the S3 bucket: spark.jars.pa...

Show Detail

Spark Write to S3 with S3A Committers fails on space character of partition column value

When trying to have spark (3.1.1) write to S3 bucket partitioned data using the S3A committers I am getting an error: Caused by: java.lang.IllegalStateException: Cannot parse URI s3a://partition-sp...

Show Detail

Spark. Problem when writing a large file on aws s3a storage

I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using ...

Show Detail