1

I have two pyspark dataframes - Dataframe A

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b|  5|
|   c|  7|
+----+---+

Dataframe B

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b| 10|
|   c| 13|
+----+---+

I want to subtract dataframe B from Dataframe A based on column id. So the result dataframe should be -

+----+---+
|name| id|
+----+---+
|   b|  5|
|   c|  7|
+----+---+

This is my code,

common = A.join(B, ['id'], 'leftsemi')
diff = A.subtract(common)
diff.show()

But it does not give expected result. Is there a simple way to achieve this which can subtract on dataframe from another based on one column value. Unable to find it. Thanks in advance

Codegator
  • 459
  • 7
  • 28
  • Does this answer your question? [Spark: subtract two DataFrames](https://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes) – Lamanus Aug 22 '20 at 06:29
  • But this is subtract based on all rows. I am looking it only based on one column – Codegator Aug 22 '20 at 07:37

2 Answers2

3

If you want to compare the id column only, then use the left_anti join. This will give you the result that is in df1 but not in df2 by comparing the id column.

df1.join(df2, ['id'], 'left_anti').show(10, False)

+---+----+
|id |name|
+---+----+
|5  |b   |
|7  |c   |
+---+----+

subtract is all.

df1.subtract(df2).show(10, False)

+----+---+
|name|id |
+----+---+
|b   |5  |
|c   |7  |
+----+---+
Lamanus
  • 12,898
  • 4
  • 21
  • 47
3
A.subtract(B).orderBy("id").show()

+----+---+
|name| id|
+----+---+
|   b|  5|
|   c|  7|
+----+---+
VITTAL B
  • 31
  • 4