2

I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do:

Rdd1, Rdd2 
Rdd3 = Rdd1.union(rdd2); 

in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs.

just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).

Joy Rex
  • 608
  • 7
  • 32
Abhishek
  • 235
  • 3
  • 11

1 Answers1

3

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background

If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?

So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.

Community
  • 1
  • 1
mattinbits
  • 10,370
  • 1
  • 26
  • 35