I have a huge dataset of partitioned parquet files stored in AWS s3 and I want to read only a sample from each month of data using AWS EMR. I have to filter data for each month by a value "user_id" selecting, for example, data from 100.000 users (out of millions) and writing the aggregations back to s3.
I figured out how to read and write to s3 using EMR clusters, but I tested on a very small dataset. For the real dataset, I need to filter data to be able to process it. How to do this using pyspark?