I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working). No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues. as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)
Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)
the following configuration works only for small files, but using the follwing sparkSession config:
s3 config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('fs.s3.awsAccessKeyId', myAccessId')\
.config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
.config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
s3a config:
spark = SparkSession.builder.appName('JSON2parquet')\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
.config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
.config("spark.sql.parquet.filterPushdown", "true")\
.config("spark.sql.parquet.mergeSchema", "false")\
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
.config("spark.speculation", "false")\
.getOrCreate()
JARs for s3 read-write (spark.driver.extraClassPath):
hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar
Is there any other .config I can use to solve this issue?
Thanks, Mosh.