I am running a AWS Glue job, with pyspark, where I read a json file from S3, do some transformations on it, and then save it back to the same location. The following is the behavior
df = spark.read.json(path)
checkpoint the df and read again
df = df.transformation1()
df = df.transformation2()
print(df.count())
df.write(s3_path)
The count gets printed, but the write to S3 fails with the exception Spark job fails with org.apache.spark.SparkException: Job 207 cancelled because SparkContext was shut down
After searching online, it seems like this is an OOM issue. The data in question is in MBs, so its pretty small. Considering that the count operation succeeds, and write fails, would it make sense to persist the dataframe before the count, so that the write would probably succeed as the data to write is already computed and persisted.
Other error logs that I found:
Lost executor 2 on 172.36.186.165: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN
Copyright Notice:Content Author:「Ajayv」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/74781284/spark-job-fails-with-org-apache-spark-sparkexception-job-207-cancelled-because