1

How can I perform filter operation on Dataframe1 using Dataframe2. I want to remove rows from DataFrame1 for below matching condition

Dataframe1.col1 = Dataframe2.col1
Dataframe1.col2 = Dataframe2.col2

My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns

GPopat
  • 445
  • 4
  • 14
  • My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns – GPopat Aug 21 '19 at 07:44

2 Answers2

3

join with "left_anti"

scala> df1.show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   2|  two|  dho|
|   3|three|theen|
|   4| four|chaar|
+----+-----+-----+


scala> df2.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   2| two|  dho|
|   4|four|chaar|
+----+----+-----+


scala> df1.join(df2, Seq("col1", "col2"), "left_anti").show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   3|three|theen|
+----+-----+-----+
C.S.Reddy Gadipally
  • 1,718
  • 11
  • 22
1

Possible duplicate of :Spark: subtract two DataFrames if both datasets have exact same coulmns

If you want custom join condition then you can use "anti" join. Here is the pysaprk version

Creating two data frames:

Dataframe1 :

l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')

df1.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row3|  30|
+---------+----+

Dataframe2 :

l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row4|  40|
+---------+----+

Using subtract api :

df_final = df1.subtract(df2)
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+

Using left_anti :

Join condition:

join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]

Join finally

df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+
SMaZ
  • 2,515
  • 1
  • 12
  • 26