I have a csv file that consists of data in this format:
id, name, surname, morecolumns
5, John, Lok, more
2, John2, Lok2, more
1, John3, Lok3, more
etc..
I want to sort my csv file using the id as key and store the sorted results in another file.
What I've done so far in order to create JavaPairs of (id, rest_of_line).
SparkConf conf = new SparkConf().setAppName.....;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("inputfile.csv");
// extract the header
JavaRDD<String> lines = file.filter(s -> !s.equals(header));
// create JavaPairs
JavaPairRDD<Integer, String> pairRdd = lines.mapToPair(
new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(final String line) {
String str = line.split(",", 2)[0];
String str2 = line.split(",", 2)[1];
int id = Integer.parseInt(str);
return new Tuple2(id, str2);
}
});
// sort and save the output
pairRdd.sortByKey(true, 1);
pairRdd.coalesce(1).saveAsTextFile("sorted.csv");
This works in cases that I have small files. However when I am using bigger files, the output is not sorted properly. I think this happens because the sort procedure takes place on different nodes, so the merge of all the procedures from all the nodes doesn't give the expected output.
So, the question is how can I sort my csv file using the id as key and store the sorted results in another file.