Pyspark: Replacing value in a column by searching a dictionary with values

Question

I have a situation where I have a dictionary of items in PySpark like this:

swap={'A': 0.07677341668184234,
 <NA>: 0.1497896460766734,
 'B': 0.07186667210628232}

Note the "pandas.NA" object defined as one of the keys.

I also have a pandas table set up with various values that may or may not be in the "swap" dictionary above:

index  column
1      C
2      B
3      <NA>
4      A

Per other code I've found on stackoverflow here: Pyspark: Replacing value in a column by searching a dictionary, I have been using the following function to swap out the above column:

def recode(col_name, map_dict, default=None):
    if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
        col_name = col(col_name)
    mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
    if default is None:
        return mapping_expr.getItem(col_name)
    else:
        return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)

and the following command:

data=df.withColumn("column", recode('column', swap, default=0))

where 'df' would be the example dataframe above. The expected output should be:

index  column
1      0
2      0.07186667210628232
3      0.1497896460766734
4      0.07677341668184234

However I get the error:

AttributeError: 'NAType' object has no attribute '_get_object_id'

This error is caused by the "pandas.NA" object in the "swap" dictionary. How can I get this code to work as expected and stop crashing?

score 0 · Answer 1 · answered Nov 19 '21 at 17:41

So, I have a partial answer to the question:

def recode(col_name, map_dict, default=None):
    if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
        col_name = col(col_name)

    if pd.NA in map_dict:
        col_name=when(~isnull(col_name), col_name).otherwise(map_dict[pd.NA])
        map_dict.pop(pd.NA)

    mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
    if default is None:
        return mapping_expr.getItem(col_name)
    else:
        return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)

The three lines starting with "if pd.NA in map_dict:" prevent the crash and will work most of the time, but will fail for some cases of "map_dict", such as:

map_dict={'A': 0.07677341668184234,
 <NA>: 'A',
 'B': 0.07186667210628232}

This will result in "pandas.NA" values being replaced by "0.07677341668184234" instead of "A" as desired. A more elegant solution would not suffer from this problem.

Pyspark: Replacing value in a column by searching a dictionary with values

1 Answers1