1

I have a collection of TSV files on Azure blob storage that I need to split based on the ID of the record.

e.g. a record format is:

|ID|Name|Address   |
|--|----|----------|
|34|Stephen|A House|

I want to split on ID and store all records by ID e.g. 34.csv

To clarify also, the data has millions of rows with up to 80k different possible IDs - the solution outlined in Write to multiple outputs by key Spark - one Spark job is too slow. It takes well over an hour to process roughly 80 million rows which is far too slow!

Stephen
  • 559
  • 6
  • 17

0 Answers0