1

I have two pyspark dataframes:

|  A  |  B  |  C  |
| 21  | 999 | 1000|
| 22  | 786 | 1978|
| 23  | 345 | 1563|

and

|  A  |  D  |  E  |
| 21  | aaa | a12 |
| 22  | bbb | b43 |
| 23  | ccc | h67 |

Desired result:

|  A  |  B  |  C  |  E  |
| 21  | 999 | 1000| a12 |
| 22  | 786 | 1978| b43 |
| 23  | 345 | 1563| h67 |

I tried using join, even df1.join(df2.E, df1.A == df2.A) to no avail.

Qubix
  • 4,161
  • 7
  • 36
  • 73
  • Possible duplicate of [pandas: merge (join) two data frames on multiple columns](https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns) – Sotos Nov 14 '18 at 14:33
  • `df1.join(df2, "A").select("A", "B", "C", "E")` – shay__ Nov 14 '18 at 14:36

2 Answers2

3

I think this code does what you want:

joinedDF = df1.join(df2.select('A', 'E'), ['A'])
Ali AzG
  • 1,861
  • 2
  • 18
  • 28
3

When you are trying to join the 2 dataframe using the function join it takes 3 arguments.

  1. arg-1 : another dataframe which you need to join.
  2. arg-2 : columns based on which you need to join the dataframes.
  3. arg-3 : Type of join you want to perform. by default its inner join.

PFB sample code.

df1.join(df2, df1.id == df2.id, 'outer')

You can find more details here.

Regards,

Neeraj

Neeraj Bhadani
  • 2,930
  • 16
  • 26