NickName:Adi C Ask DateTime:2019-12-12T05:10:32

Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.

We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.

SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
        at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
        at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 24 more

Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration? Changing the code is not an option we entertain as it would involve quite a lot of testing.

The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.

EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.

I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths. Works fine, but would be nice if this was a config option.

def massage_paths(paths):
    if isinstance(paths, basestring):
        return 's3a' + x[2:] if x.startswith('s3:') else x
    if isinstance(paths, list):
        t = list
    else:
        t = tuple
    return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])

Copyright Notice：Content Author:「Adi C」，Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/59294131/can-you-translate-or-alias-s3-to-s3a-in-spark-hadoop

Answers

stevel 2019-12-17T17:22:11

cricket007 is correct. \n\nspark.hadoop.fs.s3.impl org.apache.fs.s3a.S3AFileSystem\n\n\nThere's some code in org.apache.hadoop.FileSystem which looks up from a schema \"s3\" to an implementation class, loads it and instantiates it with the full URL. \n\nWarning There's no specific code in the core S3A FS which looks for an FS schema being s3a, but you will encounter problems if you use the DynamoDB consistency layer \"S3Guard\" -that's probably a bit of overkill someone could fix",

OneCricketeer 2019-12-13T03:44:18

Ideally, you could refactor the code to detect the runtime environment, or externalize the paths to a config file that could be used in the respective areas.\n\nOtherwise, you would need to edit the hdfs-site.xml to configure the fs.s3a.impl key to rename s3a to s3, and you might be able to keep the value the same. That change would need done for all Spark workers ",