0

I have saved the below table using pyspark to AWS S3, partitioned by column "channel_name". using below code.

 df.write.option("header",True) \
                .partitionBy("channel_name") \
                .mode('append')\
                .parquet("s3://path") 

start_timestamp channel_name value
2020-11-02 08:51:50 velocity 1
2020-11-02 09:14:29 Temp 0
2020-11-02 09:18:32 velocity 0
2020-11-02 09:32:42 velocity 4
2020-11-03 13:06:03 Temp 2
2020-11-03 13:10:01 Temp 1
2020-11-03 13:54:38 Temp 5
2020-11-03 14:46:25 velocity 5
2020-11-03 14:57:31 Kilometer 6
2020-11-03 15:07:07 Kilometer 7

But i want to read same data which is partitoned on column "channel_name" using python, its not working, it is excluding that partitioned column "channel_name". below is code i tried with AWSwrangler.

import awswrangler as wr
df = wr.s3.read_parquet(path="s3://shreyasbigdata/Prod_test_item_id=V214944/")

It looks like this, but i want that "channel_name" column also.

start_timestamp value
2020-11-02 08:51:50 1
2020-11-02 09:14:29 0
2020-11-02 09:18:32 0
2020-11-02 09:32:42 4
2020-11-03 13:06:03 2
2020-11-03 13:10:01 1
2020-11-03 13:54:38 5
2020-11-03 14:46:25 5
2020-11-03 14:57:31 6
2020-11-03 15:07:07 7

I tried with different libraries but its not working. Would be great if you help me to read all the columns including partitioned one.

SSS
  • 73
  • 11

1 Answers1

0

I got the Answer thank you

import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()

bucket = 'bucket_name'
path = 'path_of_folder' #if its a directory omit the traling /
bucket_uri = f's3://{bucket}/{path}'

dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas() 
SSS
  • 73
  • 11
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 10 '22 at 10:06
  • found this answer at following link: https://stackoverflow.com/questions/45082832/how-to-read-partitioned-parquet-files-from-s3-using-pyarrow-in-python – SSS Feb 10 '22 at 10:17