7

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.

From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.

How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

+------------+---------+ | col1 | col2 | +------------+---------+ | 0 | a | | 1 | b | +------------+---------+

joined with

+------------+---------+ | col2 | col3 | +------------+---------+ | b | x | | a | y | +------------+---------+

on col2 should give

+------------+--------------------+ | col1 | col2 | col 3 | +------------+---------+----------+ | 0 | a | y | | 1 | b | x | +------------+---------+----------+

I've heard some things about using coalesce or repartition, but I'm not sure. Any suggestions/methods/insights are appreciated.

Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?

Community
  • 1
  • 1
jest jest
  • 125
  • 1
  • 2
  • 8
  • See this question: http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark – Saif Charaniya Jun 28 '16 at 21:34
  • I don't think zip would work because the rows from table 2 should be joined to rows from table 1 while preserving order, not that row 1 goes with row 1', etc, and similarly for the method of indexing and joining. – jest jest Jun 28 '16 at 21:50
  • I just noticed from your example above that col2 is being used for the join condition. Is that what you want? – Saif Charaniya Jun 28 '16 at 22:03
  • Yes, col2 should be the join condition. I'm sorry if that was not clear, will edit the question. – jest jest Jun 28 '16 at 22:11
  • In that case, I expect Spark will maintain the order in the resulting dataframe. So if you perform `a.join(b, a.col2=b.col2)` then the resulting dataframe should be ordered by a. The order of the dataframe should only really matter to if you perform a take or collect in spark. If there's a natural order you want, then you could always just sort the dataframe. – Saif Charaniya Jun 28 '16 at 22:26
  • This is a really interesting question, and it seems like you should be able to save a shuffle operation if this were implemented as an argument for `join` – Jeff Jun 28 '16 at 23:06
  • I tried joining, and it looks like it might be just ordering by the key of `a`, but not ordering by `col1`. If this is true, then it makes sense, but I'm looking for an ordering based on something other than the key. – jest jest Jun 28 '16 at 23:53

1 Answers1

6

It can't. You can add monotonically_increasing_id and reorder data after join.