Spark: subtract dataframes but preserve duplicate values

Question

Suppose I have two Spark SQL dataframes A and B. I want to subtract the items in B from the items in A while preserving duplicates from A.

I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A.

As a conceptual example, if I have two dataframes:

words     = [the, quick, fox, a, brown, fox]
stopWords = [the, a]

then I want the output to be, in any order:

words - stopWords = [quick, brown, fox, fox]

I observed that the RDD function subtract() preserves the duplicates, but the Spark-SQL function except() removes duplicates in the resulting data frame. I don't understand why the except() output produces only unique values.

Here is a complete demonstration:

// ---------------------------------------------------------------
// EXAMPLE USING RDDs
// ---------------------------------------------------------------
var wordsRdd = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox"))
var stopWordsRdd = sc.parallelize(List("a", "the"))

var wordsWithoutStopWordsRdd = wordsRdd.subtract(stopWordsRdd)
wordsWithoutStopWordsRdd.take(10)
// res11: Array[String] = Array(quick, brown, fox, fox)

// ---------------------------------------------------------------
// EXAMPLE USING DATAFRAMES
// ---------------------------------------------------------------
var wordsDf = wordsRdd.toDF()
var stopWordsDf = stopWords.toDF()
var wordsWithoutStopWordsDf = wordsDf.except(stopWordsDf)

wordsWithoutStopWordsDf.show(10)
// +-----+
// |value|
// +-----+
// |  fox|
// |brown|
// |quick|
// +-----+

I want to preserve duplicates because I am generating frequency tables.

Any help would be appreciated.

I bet it is because it uses a `Set` instead of a `List`. Besides what you are looking for is a `join`. — Alberto Bonsanto, Apr 23 '17 at 00:23
@AlbertoBonsanto: A `join` performs an intersection, right? The result will be values that are in both `A` and `B`. I want the difference between them, not the intersection. — stackoverflowuser2010, Apr 23 '17 at 00:30
Doesn't `DataFrame.subtract()` do exactly what you asked? Or was it not available in Scala? — Bikash Gyawali, May 06 '21 at 10:19

score 1 · Accepted Answer · answered Apr 23 '17 at 00:31

1

val words = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox")).toDF("id")
val stopwords = sc.parallelize(List("a", "the")).toDF("id")


words.join(stopwords, words("id") === stopwords("id"), "left_outer")
     .where(stopwords("id").isNull)
     .select(words("id")).show()

The output is:

+-----+
|   id|
+-----+
|  fox|
|  fox|
|brown|
|quick|
+-----+

answered Apr 23 '17 at 00:31

Alberto Bonsanto

17,556
10
64
93

That left outer join approach is really slick. Thanks. – stackoverflowuser2010 Apr 23 '17 at 04:15
Do you have a different idea in case we have mutli columns? – Moustafa Mahmoud Sep 02 '18 at 08:26

Oliver Longhi · Answer 2 · 2023-03-29T06:22:48.167

1

You can try to use exceptAll:

https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.DataFrame.exceptAll.html

For example:

val words = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox")).toDF("id")
val stopwords = sc.parallelize(List("a", "the")).toDF("id")

words.exceptAll(stopwords).show()

Output should be:

+-----+
|   id|
+-----+
|quick|
|  fox|
|  fox|
|brown|
+-----+

edited Mar 29 '23 at 06:22

answered Mar 29 '23 at 06:17

Oliver Longhi

11
3

Spark: subtract dataframes but preserve duplicate values

2 Answers2

Linked