I've 2 dataframes like the following:
DF1:
Id | field_A | field_B | field_C | field_D
1 | cat | 12 | black | 11
2 | dog | 128 | white | 19
3 | dog | 35 | yellow | 20
4 | dog | 21 | brown | 4
5 | bird | 10 | blue | 7
6 | cow | 99 | brown | 34
DF2:
Id | field_B | field_C | field_D | field_E
3 | 35 | yellow | 20 | 123
5 | 10 | blue | 7 | 454
6 | 99 | brown | 34 | 398
And after left merge I'm hoping to get the following dataframe:
Id | field_A | field_B | field_C | field_D | field_E
1 | cat | 12 | black | 11 |
2 | dog | 128 | white | 19 |
3 | dog | 35 | yellow | 20 | 123
4 | dog | 21 | brown | 4 |
5 | bird | 10 | blue | 7 | 454
6 | cow | 99 | brown | 34 | 398
But I'm getting the following dataframe:
Id | field_A | field_B | field_C | field_D | field_E
3 | dog | 35 | yellow | 20 | 123
5 | bird | 10 | blue | 7 | 454
6 | cow | 99 | brown | 34 | 398
I'm using the following syntax:
new_df = df1.join(df2, on=['field_B', 'field_C', 'field_D'], how='left_outer')
I'm working on spark 2.2. Can anyone please tell me why this is happening? Thanks!