read parquet file from s3 using pyspark issue

Question

I am trying to read parquet files from S3 but it kills my server (processing for a very long time, must reset machine in order to continue working). No issue in writing the parquet file to S3, and when trying to write and read from local it works perfectly. When trying to read small files from s3 there are no issues. as seen in many threads, spark's "s3a" file system client (2nd config here) should be able to handle it but in fact I get 'NoSuchMethodError' when trying to use s3a (with the proper s3a configuration listed below)

Py4JJavaError: An error occurred while calling o155.json.
: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)

the following configuration works only for small files, but using the follwing sparkSession config:

s3 config:

spark = SparkSession.builder.appName('JSON2parquet')\
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
            .config('fs.s3.awsAccessKeyId', myAccessId')\
            .config('fs.s3.awsSecretAccessKey', 'myAccessKey')\
            .config('fs.s3.impl', 'org.apache.hadoop.fs.s3native.NativeS3FileSystem')\
            .config("spark.sql.parquet.filterPushdown", "true")\
            .config("spark.sql.parquet.mergeSchema", "false")\
            .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
            .config("spark.speculation", "false")\
            .getOrCreate()

s3a config:

spark = SparkSession.builder.appName('JSON2parquet')\
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
            .config('spark.hadoop.fs.s3a.access.key', 'myAccessId')\
            .config('spark.hadoop.fs.s3a.secret.key', 'myAccessKey')\
            .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')\
            .config("spark.sql.parquet.filterPushdown", "true")\
            .config("spark.sql.parquet.mergeSchema", "false")\
            .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")\
            .config("spark.speculation", "false")\
            .getOrCreate()

JARs for s3 read-write (spark.driver.extraClassPath):

hadoop-aws-2.7.3.jar,
**hadoop-common-2.7.3.jar**, -- added in order to use S3a
aws-java-sdk-s3-1.11.156.jar

Is there any other .config I can use to solve this issue?

Thanks, Mosh.

You **need** `hadoop-common` if you want `org/apache/hadoop/fs` classes, but Spark otherwise has lots of `2.7.x` version of Hadoop Jars, not `3.x` — OneCricketeer, Sep 26 '19 at 21:38
I tried ```hadoop-common-2.7.3``` with ```hadoop-aws-2.7.3``` and still nothing. (both should do the job according to documentation) Is **s3a** required for aws-spark-parquet connectivity or is there any way to have it working via **s3**? — Moshik Mishaeli, Oct 03 '19 at 09:26
You 6 should at least get a different error by including those. I'm not sure what you mean 'via S3". You need the Hadoop libraries to translate filesystem methods to S3 API calls — OneCricketeer, Oct 03 '19 at 12:42
s3a file system client -- as I mentioned in the original post: s3 configuration: ```spark.conf.set('fs.s3.awsAccessKeyId', '%%myAcsessId%%') spark.conf.set('fs.s3.awsSecretAccessKey', '%%mySecretKey%%') spark.conf.set('fs.s3.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem'``` S3a configuration: ```spark.conf.set('fs.s3a.access.key', '%%myAcsessId%%') spark.conf.set('fs.s3a.secret.key', '%%mySecretKey%%') spark.conf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')``` — Moshik Mishaeli, Oct 03 '19 at 13:09
Last I checked, S3A is to be used outside of EMR https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3 — OneCricketeer, Oct 03 '19 at 13:19
I am actually not using EMR. Does it make sense I am able to write but not read? — Moshik Mishaeli, Oct 03 '19 at 13:52

read parquet file from s3 using pyspark issue

0 Answers0