Merging 2 dataframe in Pyspark

Question

I've 2 dataframes like the following:

DF1:

Id | field_A | field_B | field_C | field_D
 1 |   cat   |  12     |   black | 11
 2 |   dog   | 128     |   white | 19
 3 |   dog   |  35     |  yellow | 20
 4 |   dog   |  21     |   brown |  4
 5 |  bird   |  10     |    blue |  7
 6 |   cow   |  99     |   brown | 34

DF2:

Id | field_B | field_C | field_D | field_E
 3 |  35     |  yellow | 20      |   123   
 5 |  10     |    blue |  7      |   454   
 6 |  99     |   brown | 34      |   398

And after left merge I'm hoping to get the following dataframe:

Id | field_A | field_B | field_C | field_D | field_E
 1 |   cat   |  12     |   black | 11      |
 2 |   dog   | 128     |   white | 19      |
 3 |   dog   |  35     |  yellow | 20      |  123
 4 |   dog   |  21     |   brown |  4      |  
 5 |  bird   |  10     |    blue |  7      |  454
 6 |   cow   |  99     |   brown | 34      |  398

But I'm getting the following dataframe:

Id | field_A | field_B | field_C | field_D | field_E 
 3 |   dog   |  35     |  yellow | 20      |  123
 5 |  bird   |  10     |    blue |  7      |  454
 6 |   cow   |  99     |   brown | 34      |  398

I'm using the following syntax:

new_df = df1.join(df2, on=['field_B', 'field_C', 'field_D'], how='left_outer')

I'm working on spark 2.2. Can anyone please tell me why this is happening? Thanks!

is it possible that df names are wrong? Your syntax looks correct. Try this new_df = df2.join(df1, on=['field_B', 'field_C', 'field_D'], how='left_outer') — Ali Yesilli, Oct 21 '18 at 19:08
Thanks Ali! I tried all linds of `outer` joins. Still getting the same answer. And the variable names are correct, else it would have thrown me an error. — Beta, Oct 21 '18 at 19:13
Can you confirm if the order of df1 and df2 is same as you have stated in the question? In first look it seems the order is reversed. To get over this order issue try passing how='outer'. If that works, just change the order of df1 and df2 in your statement. — Pintu, Oct 22 '18 at 05:10
The order of columns is not really important in spark, and you can always change them to whatever you'd like. In your case, just do `new_df = new_df.select("id", *["field_"+c for c in ["A", "B", "C", "D", "E"]])` after the `join`. — pault, Oct 22 '18 at 15:11

score 0 · Answer 1 · answered Oct 22 '18 at 05:18

The code snippet shared by you works completely fine. Looks like the order of join is not the same as you have stated in your question. Try
df2.join(df1, on='field_B', 'field_C', 'field_D'], how='left')

If that works, your order of dataframes in different than what you have shown here.

Ali AzG · Answer 2 · 2018-10-31T10:21:03.770

0

Based on this answer, you have to use the unionDFs function in the link below, and then groupby. https://stackoverflow.com/a/50665187/6645617

and here is a sample code to do this:

from pyspark.sql.functions import first

unionDF = unionDFs(DF1,DF2)
firstCols = [first(col).alias(str(col)) for col in unionDF.columns[1:]]

finalDF = unionDF.groupby("Id").agg(*firstCols)

edited Oct 31 '18 at 10:21

answered Oct 31 '18 at 09:30

Ali AzG

1,861
2
18
28

Please checkout my answer and tell me if it works for you. – Ali AzG Nov 02 '18 at 20:54

Merging 2 dataframe in Pyspark

2 Answers2