How to split a large text into smaller files based on an id column using pyspark

Asked Aug 08 '18 at 08:24

Active Aug 08 '18 at 15:02

Viewed 95 times

I have a collection of TSV files on Azure blob storage that I need to split based on the ID of the record.

e.g. a record format is:

|ID|Name|Address   |
|--|----|----------|
|34|Stephen|A House|

I want to split on ID and store all records by ID e.g. 34.csv

To clarify also, the data has millions of rows with up to 80k different possible IDs - the solution outlined in Write to multiple outputs by key Spark - one Spark job is too slow. It takes well over an hour to process roughly 80 million rows which is far too slow!

edited Aug 08 '18 at 15:02

asked Aug 08 '18 at 08:24

Stephen

How to split a large text into smaller files based on an id column using pyspark

0 Answers0