How to subtract DataFrames using subset of columns in Apache Spark

Question

How can I perform filter operation on Dataframe1 using Dataframe2. I want to remove rows from DataFrame1 for below matching condition

Dataframe1.col1 = Dataframe2.col1
Dataframe1.col2 = Dataframe2.col2

My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns

My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns — GPopat, Aug 21 '19 at 07:44

score 3 · Accepted Answer · answered Aug 20 '19 at 18:07

3

join with "left_anti"

scala> df1.show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   2|  two|  dho|
|   3|three|theen|
|   4| four|chaar|
+----+-----+-----+


scala> df2.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   2| two|  dho|
|   4|four|chaar|
+----+----+-----+


scala> df1.join(df2, Seq("col1", "col2"), "left_anti").show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
|   1|  one|   ek|
|   3|three|theen|
+----+-----+-----+

answered Aug 20 '19 at 18:07

C.S.Reddy Gadipally

1,718
11
22

1

Thanks it works for me! Here is using Java Dataset filterDf = dataset1.join(dataset2,JavaConverters.asScalaIteratorConverter(Arrays.asList(new String[]{"col1", "col2"}).iterator()).asScala().toSeq(),"left_anti"); – GPopat Aug 21 '19 at 16:27
1

@Gaurang If possible, please edit the question subject from "How to filter data frames in spark" to more specific. It helps others with their specific need. – C.S.Reddy Gadipally Aug 21 '19 at 16:32
do you have any suggestion ? – GPopat Aug 21 '19 at 16:37
1

DataFrames subtract over a subset of columns? just like you have described in your description. – C.S.Reddy Gadipally Aug 21 '19 at 16:40
Thanks for suggestion..I change question title. – GPopat Aug 21 '19 at 16:43

SMaZ · Answer 2 · 2019-08-20T18:30:24.457

Possible duplicate of :Spark: subtract two DataFrames if both datasets have exact same coulmns

If you want custom join condition then you can use "anti" join. Here is the pysaprk version

Creating two data frames:

Dataframe1 :

l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')

df1.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row3|  30|
+---------+----+

Dataframe2 :

l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row1|  10|
|col1_row2|  20|
|col1_row4|  40|
+---------+----+

Using subtract api :

df_final = df1.subtract(df2)
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+

Using left_anti :

Join condition:

join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]

Join finally

df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
|     col1|col2|
+---------+----+
|col1_row3|  30|
+---------+----+

How to subtract DataFrames using subset of columns in Apache Spark

2 Answers2

Linked