Sorting RDD by key

Asked Oct 23 '17 at 17:46

Active Oct 23 '17 at 17:46

Viewed 35 times

Spark 2.0.1

I have some rdd:

class MyClass{ }

JavaRDD<MyClass> rdd = //new hadoop api file;
Comparator<MyClass> comp;

//create comparator and do some map on the rdd

rdd.mapToPait(l -> new Tuple2<>(l , ""))
.sortByKey(comp)
.mapToPair(l -> new Tuple2<>(l._1, NullWritable.get())
.saveAsNewAPIHadoopFile(somePath, AvroKey.class, NullWritable.class, SomeOutputFormat.class, hadoopConfiguration);

And this all works very unusual to me. The application consists of 2 jobs which looks as:

The strange thing here is on Job 0 we do newAPIHadoopFile -> map -> map -> sortByKey and that ok.

But on the Job 1 we do the same (Job1::stage1 is not skipping) and then sort and then save. Stage 2 takes lots of time. Why is that happening? Is there a way to fix the execution plan somewhere to more optimal?

asked Oct 23 '17 at 17:46

St.Antario

26,175
41
130
318

Sorting RDD by key

0 Answers0