I want to read a CSV file (less than 50MB) from Spark and perform some join&filter operations. The rows in the CSV file are ordered by some criteria (Score
in this case). I want to save the results in single CSV file which the original rows order are kept.
Input CSV file:
Id, Score
5, 100
3, 99
6, 98
7, 95
After some join&filter operations:
val data = spark.read.option("header", "true").csv("s3://some-bucket/some-dir/123.csv")
val results = data
.dropDuplicates($"some_col")
.filter(x => ...)
.join(anotherDataset, Seq("some_col"), "left_anti")
results.repartition(1).write.option("header", "true").csv("...")
Expected outputs:
Id, Score
5, 100
6, 98
(ID 3 and 7 are filtered out)
As Spark might loads the data into multiple partitions, how can I keep the original order?