How to read parquet file data partitioned on column from AWS S3 using python

Question

I have saved the below table using pyspark to AWS S3, partitioned by column "channel_name". using below code.

 df.write.option("header",True) \
                .partitionBy("channel_name") \
                .mode('append')\
                .parquet("s3://path")

start_timestamp	channel_name	value
2020-11-02 08:51:50	velocity	1
2020-11-02 09:14:29	Temp	0
2020-11-02 09:18:32	velocity	0
2020-11-02 09:32:42	velocity	4
2020-11-03 13:06:03	Temp	2
2020-11-03 13:10:01	Temp	1
2020-11-03 13:54:38	Temp	5
2020-11-03 14:46:25	velocity	5
2020-11-03 14:57:31	Kilometer	6
2020-11-03 15:07:07	Kilometer	7

But i want to read same data which is partitoned on column "channel_name" using python, its not working, it is excluding that partitioned column "channel_name". below is code i tried with AWSwrangler.

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://shreyasbigdata/Prod_test_item_id=V214944/")

It looks like this, but i want that "channel_name" column also.

start_timestamp	value
2020-11-02 08:51:50	1
2020-11-02 09:14:29	0
2020-11-02 09:18:32	0
2020-11-02 09:32:42	4
2020-11-03 13:06:03	2
2020-11-03 13:10:01	1
2020-11-03 13:54:38	5
2020-11-03 14:46:25	5
2020-11-03 14:57:31	6
2020-11-03 15:07:07	7

I tried with different libraries but its not working. Would be great if you help me to read all the columns including partitioned one.

score 0 · Answer 1 · answered Feb 10 '22 at 09:52

0

I got the Answer thank you

import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()

bucket = 'bucket_name'
path = 'path_of_folder' #if its a directory omit the traling /
bucket_uri = f's3://{bucket}/{path}'

dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas()

answered Feb 10 '22 at 09:52

SSS

73
11

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 10 '22 at 10:06
found this answer at following link: https://stackoverflow.com/questions/45082832/how-to-read-partitioned-parquet-files-from-s3-using-pyarrow-in-python – SSS Feb 10 '22 at 10:17

How to read parquet file data partitioned on column from AWS S3 using python

1 Answers1

I got the Answer thank you