Keeping identifier in exceptAll in PySpark

Question

I was curious if there is an easy way to keep an identifying ID in the exceptALL command in PySpark. For example, suppose I have two dataframes (DF1,DF2) both with an ID column and another column "A." I want to keep the rows in DF1 that have a value for "A" not in DF2, so essentially I am trying to keep the identifier with the usual output of exceptAll. I attached an image with the ideal output.

enter image description here

Cheers!

score 1 · Answer 1 · answered Jan 16 '20 at 08:32

You are probably looking for leftanti join in Spark:

df1 = spark.createDataFrame([
  [1, "Dog"],
  [2, "Cat"],
  [3, "Dog"]
], ["id", "A"])

df2 = spark.createDataFrame([
  [4, "Dog"],
  [5, "Elmo"]
], ["id", "A"])

df1.join(df2, ["A"], "leftanti").show()

# +---+---+
# |  A| id|
# +---+---+
# |Cat|  2|
# +---+---+

cylim · Answer 2 · 2020-01-16T05:47:45.127

pyspark's dataframe method subtract should give you what you want. See Spark: subtract two DataFrames for more details.

Using exceptAll will not give the results you wanted, as it will retain the second dog entry in df1 because exceptAll keeps duplicates.

Given your dataframes :

df1 = spark.createDataFrame([{'id': 1, 'A': 'dog'},
                             {'id': 2, 'A': 'cat'},
                             {'id': 3, 'A': 'dog'}])

df2 = spark.createDataFrame([{'id': 4, 'A': 'dog'},
                             {'id': 5, 'A': 'elmo'}])

Use subtract on column of interest (i.e. A), then join the results back to original dataframe to get the remaining columns (i.e. id).

except_df = df1.select('A').subtract(df2.select('A'))

except_df.join(df1, on='A').show()

+---+---+
|  A| id|
+---+---+
|cat|  2|
+---+---+

Keeping identifier in exceptAll in PySpark

2 Answers2