1

I have the following two data frames which have just one column each and have exact same number of rows. How do I merge them so that I get a new data frame which has the two columns and all rows from both the data frames. For example,

df1:

+-----+
| ColA|
+-----+
|    1|
|    2|
|    3|
|    4|
+-----+

df2:

+-----+
| ColB|
+-----+
|    5|
|    6|
|    7|
|    8|
+-----+

I want the result of the merge to be df3:

+-----+-----+
| ColA| ColB|
+-----+-----+
|    1|    5|
|    2|    6|
|    3|    7|
|    4|    8|
+-----+-----+

I don't quite see how I can do this with the join method because there is only one column and joining without any condition will create a cartesian join between the two columns.

Is there a direct SPARK Data Frame API call to do this? In R Data Frames, I see that there a merge function to merge two data frames. However I don't know if it is similar to join.

NewtoSO
  • 61
  • 7
  • I think you would have to create a new column with an ID on each DataFrame and then join them, even if the order doesn't matter. This task has been resolved on this question (check all answers): http://stackoverflow.com/questions/29483498 – Daniel de Paula May 13 '16 at 20:38
  • Is there any way to do it without appending the new column ID on each DataFrame? – NewtoSO May 13 '16 at 21:40
  • Not really. The only alternative is `zip` on RDDs and this has very strict requirements which typically means you need id-like column as well. – zero323 May 16 '16 at 22:29

0 Answers0