1

I've 2 dataframes like the following:

DF1:

Id | field_A | field_B | field_C | field_D
 1 |   cat   |  12     |   black | 11
 2 |   dog   | 128     |   white | 19
 3 |   dog   |  35     |  yellow | 20
 4 |   dog   |  21     |   brown |  4
 5 |  bird   |  10     |    blue |  7
 6 |   cow   |  99     |   brown | 34

DF2:

Id | field_B | field_C | field_D | field_E
 3 |  35     |  yellow | 20      |   123   
 5 |  10     |    blue |  7      |   454   
 6 |  99     |   brown | 34      |   398   

And after left merge I'm hoping to get the following dataframe:

Id | field_A | field_B | field_C | field_D | field_E
 1 |   cat   |  12     |   black | 11      |
 2 |   dog   | 128     |   white | 19      |
 3 |   dog   |  35     |  yellow | 20      |  123
 4 |   dog   |  21     |   brown |  4      |  
 5 |  bird   |  10     |    blue |  7      |  454
 6 |   cow   |  99     |   brown | 34      |  398

But I'm getting the following dataframe:

Id | field_A | field_B | field_C | field_D | field_E 
 3 |   dog   |  35     |  yellow | 20      |  123
 5 |  bird   |  10     |    blue |  7      |  454
 6 |   cow   |  99     |   brown | 34      |  398

I'm using the following syntax:

new_df = df1.join(df2, on=['field_B', 'field_C', 'field_D'], how='left_outer')

I'm working on spark 2.2. Can anyone please tell me why this is happening? Thanks!

pault
  • 41,343
  • 15
  • 107
  • 149
Beta
  • 1,638
  • 5
  • 33
  • 67
  • is it possible that df names are wrong? Your syntax looks correct. Try this new_df = df2.join(df1, on=['field_B', 'field_C', 'field_D'], how='left_outer') – Ali Yesilli Oct 21 '18 at 19:08
  • Thanks Ali! I tried all linds of `outer` joins. Still getting the same answer. And the variable names are correct, else it would have thrown me an error. – Beta Oct 21 '18 at 19:13
  • Can you confirm if the order of df1 and df2 is same as you have stated in the question? In first look it seems the order is reversed. To get over this order issue try passing how='outer'. If that works, just change the order of df1 and df2 in your statement. – Pintu Oct 22 '18 at 05:10
  • The order of columns is not really important in spark, and you can always change them to whatever you'd like. In your case, just do `new_df = new_df.select("id", *["field_"+c for c in ["A", "B", "C", "D", "E"]])` after the `join`. – pault Oct 22 '18 at 15:11

2 Answers2

0

The code snippet shared by you works completely fine. Looks like the order of join is not the same as you have stated in your question. Try
df2.join(df1, on='field_B', 'field_C', 'field_D'], how='left')

If that works, your order of dataframes in different than what you have shown here.

Pintu
  • 278
  • 1
  • 6
0

Based on this answer, you have to use the unionDFs function in the link below, and then groupby. https://stackoverflow.com/a/50665187/6645617

and here is a sample code to do this:

from pyspark.sql.functions import first

unionDF = unionDFs(DF1,DF2)
firstCols = [first(col).alias(str(col)) for col in unionDF.columns[1:]]

finalDF = unionDF.groupby("Id").agg(*firstCols)
Ali AzG
  • 1,861
  • 2
  • 18
  • 28