I need to perform a left join in Spark 2.4.1 that keeps the Null values.
While researching I found this solution: Including null values in an Apache Spark Join which seems to be it. Everytime I call eqNullSafe however I get the error "'Column' object is not callable"
I have tried the example provided under the link:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
Any idea why this code would raise these issues? I am using a SageMaker notebook on a AWS Glue developer endpoint. Might it be due to missing import?
These are the imports I do aside from those specific to glue:
from pyspark.sql import *
from pyspark.sql import functions as F