I have a broadcast a dicitonary that I would like to use for mapping column values in my DataFrame
. Let's say I call withColumn()
method for that.
I can only get it to work with a UDF, but not directly:
sc = SparkContext()
ss = SparkSession(sc)
df = ss.createDataFrame( [ "a", "b" ], StringType() ).toDF("key")
# +---+
# |key|
# +---+
# | a|
# | b|
# +---+
thedict={"a":"A","b":"B","c":"C"}
thedict_bc=sc.broadcast(thedict)
Referencing with a literal or using UDF works fine:
df.withColumn('upper',lit(thedict_bc.value.get('c',"--"))).show()
# +---+-----+
# |key|upper|
# +---+-----+
# | a| C|
# | b| C|
# +---+-----+
df.withColumn('upper',udf(lambda x : thedict_bc.value.get(x,"--"), StringType())('key')).show()
# +---+-----+
# |key|upper|
# +---+-----+
# | a| A|
# | b| B|
# +---+-----+
However, accessing the dictionary directly from the command doesn't:
df.withColumn('upper',lit(thedict_bc.value.get(col('key'),"--"))).show()
# +---+-----+
# |key|upper|
# +---+-----+
# | a| --|
# | b| --|
# +---+-----+
df.withColumn('upper',lit(thedict_bc.value.get(df.key,"--"))).show()
# +---+-----+
# |key|upper|
# +---+-----+
# | a| --|
# | b| --|
# +---+-----+
df.withColumn('upper',lit(thedict_bc.value.get(df.key.cast("string"),"--"))).show()
# +---+-----+
# |key|upper|
# +---+-----+
# | a| --|
# | b| --|
# +---+-----+
Am I missing something obvious?