0

I have few parquet files in s3 bucket (s3://mybucket/my/path/). I want to read it using boto3 into spark dataframe.

I can't read it directly as spark.read.parquet('s3://mybucket/my/path/') because of existing security. So, need to read it using boto3.

while trying to read a single parquet file(s3://mybucket/my/path/myfile1.parquet) using below code, I am getting error.

res = autorefresh_session.resource('s3')
bucket = res.Bucket(name=mybucket)
obj = bucket.objects.filter(prefix=/my/path)
body = io.BytesIO(obj.get()['Body'].read())
spark.read.parquet(body).show()

Py4JJavaError: An error occurred while calling xyz.parquet. : java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.spark.sql.DataFrameReader.preprocessDeltaLoading(DataFrameReader.scala:282)

Can anyone please let me know how can we read a single file and complete folder using boto3?

I can read csv files successfully using above approach but not parquet file. I can read single file into pandas df and then spark, but this will not be a efficient way to read.

prasanta
  • 25
  • 1
  • 6

1 Answers1

0

You can use following steps.

Step-01 : Read your parquet s3 location and convert as panda dataframe. ref

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

Step-02 : Convert panda dataframe into spark dataframe :

# Spark to Pandas
df_pd = df.toPandas()

# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)
  • reading into pandas dataframe using boto3 is working fine, but when file size is big this will not be an efficient way. – prasanta Nov 17 '21 at 22:00