We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.
We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.
SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 24 more
Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration? Changing the code is not an option we entertain as it would involve quite a lot of testing.
The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.
EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.
I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths. Works fine, but would be nice if this was a config option.
def massage_paths(paths):
if isinstance(paths, basestring):
return 's3a' + x[2:] if x.startswith('s3:') else x
if isinstance(paths, list):
t = list
else:
t = tuple
return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])
Copyright Notice:Content Author:「Adi C」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/59294131/can-you-translate-or-alias-s3-to-s3a-in-spark-hadoop