Partition by columns: data being truncated to another partition

Question

I'm using repartitionByRange in PySpark while saving over 2,000+ CSV's.

df.repartitionByRange(<no of unique values of col>, col).write\
        .option("sep", "|")\
        .option("header", "true")\
        .option("quote",  '"')\
        .option("escape", '"')\
        .option("nullValue", "null")\
        .option("quoteAll", "true")\
        .mode('overwrite')\
        .csv(path)

And then renaming each partition with the unique id of column that they contain. However, around 1-2% of the CSV's being generated have more than one unique id. Please assist resolving this issue of incorrect repartitioning.

score 0 · Answer 1 · answered Dec 18 '20 at 13:48

If you want to control how your files are written you must use the bucketBymethod instead. repartition does not impact the saved file, simply the processing.

You can refer to this post for additional information : What is the difference between partitioning and bucketing in Spark?

So something like this :

df.write.format('parquet').bucketBy(10, 'column')

Partition by columns: data being truncated to another partition

1 Answers1