I have a situation where I have a dictionary of items in PySpark like this:
swap={'A': 0.07677341668184234,
<NA>: 0.1497896460766734,
'B': 0.07186667210628232}
Note the "pandas.NA" object defined as one of the keys.
I also have a pandas table set up with various values that may or may not be in the "swap" dictionary above:
index column
1 C
2 B
3 <NA>
4 A
Per other code I've found on stackoverflow here: Pyspark: Replacing value in a column by searching a dictionary, I have been using the following function to swap out the above column:
def recode(col_name, map_dict, default=None):
if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
col_name = col(col_name)
mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
if default is None:
return mapping_expr.getItem(col_name)
else:
return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)
and the following command:
data=df.withColumn("column", recode('column', swap, default=0))
where 'df' would be the example dataframe above. The expected output should be:
index column
1 0
2 0.07186667210628232
3 0.1497896460766734
4 0.07677341668184234
However I get the error:
AttributeError: 'NAType' object has no attribute '_get_object_id'
This error is caused by the "pandas.NA" object in the "swap" dictionary. How can I get this code to work as expected and stop crashing?