I'm using repartitionByRange
in PySpark while saving over 2,000+ CSV's.
df.repartitionByRange(<no of unique values of col>, col).write\
.option("sep", "|")\
.option("header", "true")\
.option("quote", '"')\
.option("escape", '"')\
.option("nullValue", "null")\
.option("quoteAll", "true")\
.mode('overwrite')\
.csv(path)
And then renaming each partition with the unique id of column that they contain. However, around 1-2% of the CSV's being generated have more than one unique id. Please assist resolving this issue of incorrect repartitioning.