1

What would be the equivalent of this call in Spark version 2.2.1:

df.column_name.eqNullSafe(df2.column_2)

(df.column_name is not callable. It works in 2.3.0 but in 2.2.1 I get the error: TypeError: 'Column' object is not callable)

Here's an example for reproduction. I have a sample dataframe:

# +----+----+
# |  id| var|
# +----+----+
# |   1|   a|
# |   2|null|
# |null|   b|
# +----+----+

I need to deconstruct it then do a null-safe equals on a column to compare and put it back together. This is the code that does that. (it can be pasted and ran as is, works in 2.3.0, reproduces the error in 2.2.1)

df = spark.createDataFrame(
    [
        ('1', 'a'),
        ('2', None),
        (None, 'b')
    ],
    ('id', 'var')
)


def get_condition(right, left):
    return right.id.eqNullSafe(left.id_2)


right_df = df.select(df.columns[:1])
left_df = df.filter(df.var.isNotNull()).withColumnRenamed('id', 'id_2')

result = right_df.join(left_df, get_condition(right_df, left_df), how='left')

result.select('id', 'var').show()

I'd like to modify the get_condition method's call to use a callable version of the column to call eqNullSafe. (note, can't use pandas)

Tibberzz
  • 541
  • 1
  • 10
  • 23
  • Alternatively this worked for me: return (right.id == left.id_2) | (right.id.isNull() & left.id_2.isNull()) – Tibberzz Jun 15 '18 at 18:33

1 Answers1

2

eqNullSafe has been included in Spark 2.3 (SPARK-20290) so you won't be able to use it in 2.2.

There are different alternatives (SQL / DataFrame) available:

  • id1 IS NOT DISTINCT FROM id / expr("id IS NOT DISTINCT FROM id2") (Spark 2.2 or later)
  • ((id1 IS NULL) AND (id2 IS NULL)) OR (id1 = id2) / ((col("id1").isNull() & col("id2").isNull()) | (col("id1") == col("id2"))

where the first one should be preferred when available.

See Including null values in an Apache Spark Join

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115