I have few parquet files in s3 bucket (s3://mybucket/my/path/). I want to read it using boto3 into spark dataframe.
I can't read it directly as spark.read.parquet('s3://mybucket/my/path/') because of existing security. So, need to read it using boto3.
while trying to read a single parquet file(s3://mybucket/my/path/myfile1.parquet) using below code, I am getting error.
res = autorefresh_session.resource('s3')
bucket = res.Bucket(name=mybucket)
obj = bucket.objects.filter(prefix=/my/path)
body = io.BytesIO(obj.get()['Body'].read())
spark.read.parquet(body).show()
Py4JJavaError: An error occurred while calling xyz.parquet. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.sql.DataFrameReader.preprocessDeltaLoading(DataFrameReader.scala:282)
Can anyone please let me know how can we read a single file and complete folder using boto3?
I can read csv files successfully using above approach but not parquet file. I can read single file into pandas df and then spark, but this will not be a efficient way to read.