1

I have a huge dataset of partitioned parquet files stored in AWS s3 and I want to read only a sample from each month of data using AWS EMR. I have to filter data for each month by a value "user_id" selecting, for example, data from 100.000 users (out of millions) and writing the aggregations back to s3.

I figured out how to read and write to s3 using EMR clusters, but I tested on a very small dataset. For the real dataset, I need to filter data to be able to process it. How to do this using pyspark?

RafaJM
  • 481
  • 6
  • 17
  • Those 100.000 users you want to filter with are stored in a collection, in a dataframe or you just want some random users? – LizardKing Nov 19 '19 at 15:32
  • I actually have them stored not as users, but events performed by users. I already have a written script to transform this event-level file to a user-level table, but now I am facing the issue of reading from a partitioned parquet file – RafaJM Nov 19 '19 at 16:27

1 Answers1

1

Spark has multiple sampling transformations. df.sample(...) is the one you want in your case. See this answer.

If you need an exact number of results back, you have to (a) over-sample by a little and then (b) use df.limit() to get the exact number.

If you can deal with just a fraction, as opposed to a target count, you can save df.count.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • Thanks, this answer is enough for what I asked, though my problem evolved to a more specific version of what I asked. Thanks anyway :) – RafaJM Nov 19 '19 at 16:26