iterating over partitions in a pyspark dataframe

Asked Apr 07 '20 at 02:59

Active Apr 07 '20 at 02:59

Viewed 854 times

I have an s3 bucket parquet file that is partitioned by date eg: s3://path/folder/ where the partitions in the folder are:

PRE date=2019-11-19/
PRE date=2019-11-20/
PRE date=2019-11-21/
PRE date=2019-11-22/
PRE date=2019-11-23/
PRE date=2019-11-26/

Each partition has millions of rows, and I want to parse it by calling each partition in a for loop and appending the resulting dataframe in another parquet file also partitioned by date. None of the solutions Ive looked up on here address my specific use case, and the few that seem to, use something called boto, which I am not using.

Any insight would be greatly appreciated. Thank You

asked Apr 07 '20 at 02:59

thentangler

1,048
2
12
38

have you seen this thread : https://stackoverflow.com/questions/45043554/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow – yasi Apr 07 '20 at 04:32
I did, but I they are using pandas there and I am not. – thentangler Apr 07 '20 at 14:07

iterating over partitions in a pyspark dataframe

0 Answers0