I have a collection of TSV files on Azure blob storage that I need to split based on the ID of the record.
e.g. a record format is:
|ID|Name|Address |
|--|----|----------|
|34|Stephen|A House|
I want to split on ID and store all records by ID e.g. 34.csv
To clarify also, the data has millions of rows with up to 80k different possible IDs - the solution outlined in Write to multiple outputs by key Spark - one Spark job is too slow. It takes well over an hour to process roughly 80 million rows which is far too slow!