Join a dataframe with a column from another, based on a common column

Question

I have two pyspark dataframes:

|  A  |  B  |  C  |
| 21  | 999 | 1000|
| 22  | 786 | 1978|
| 23  | 345 | 1563|

and

|  A  |  D  |  E  |
| 21  | aaa | a12 |
| 22  | bbb | b43 |
| 23  | ccc | h67 |

Desired result:

|  A  |  B  |  C  |  E  |
| 21  | 999 | 1000| a12 |
| 22  | 786 | 1978| b43 |
| 23  | 345 | 1563| h67 |

I tried using join, even df1.join(df2.E, df1.A == df2.A) to no avail.

Possible duplicate of [pandas: merge (join) two data frames on multiple columns](https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns) — Sotos, Nov 14 '18 at 14:33

score 3 · Accepted Answer · answered Nov 14 '18 at 14:36

3

I think this code does what you want:

joinedDF = df1.join(df2.select('A', 'E'), ['A'])

answered Nov 14 '18 at 14:36

Ali AzG

score 3 · Answer 2 · answered Nov 19 '18 at 14:07

When you are trying to join the 2 dataframe using the function join it takes 3 arguments.

PFB sample code.

df1.join(df2, df1.id == df2.id, 'outer')

You can find more details here.

Regards,

Neeraj

2 Answers2