Replacing column values by dict pyspark

Question

I have a dictionary like this

d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}

and a df:

+-----------+
|some_column|
+-----------+
|     banana|
|        cat|
|      apple|
|      other|
|       null|
+-----------+

Id like to get this as output:

+-----------+
|some_column|
+-----------+
|      fruit|
|     animal|
|      fruit|
|      other|
|       null|
+-----------+

I know that if i had a dictionary like this

{"apple" : "fruit", "banana": "fruit", [···]}

i could use df.na.replace, and of course i can work through my given dictionary and change it to something like this.

But is there a way of getting my desired output without changing the dictionary?

how do you transform 'other' to 'other' and 'null' to 'null' ? if any value in the original data frame is not a member of one of the keys in the dict, then it just retains it original identity? what if there is a value of 'chocolate' in the input dataframe? it remains 'chocolate' in the output data frame? — bici.sancta, Feb 16 '23 at 16:47

score 1 · Answer 1 · answered Feb 17 '23 at 11:55

Create a dataframe from the dictionary and join the dataframes.

d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}

df = spark.createDataFrame([[d]], ['data'])
df = df.select(f.explode('data'))
df.show()
df.printSchema()

data = ['banana', 'cat', 'apple', 'other', None]
df2 = spark.createDataFrame(data, StringType()).toDF('some_column')
df2.show()
df2.join(df, f.array_contains(f.col('value'), f.col('some_column')), 'left') \
   .select(f.coalesce('key', 'some_column').alias('some_column')) \
   .show()

+------+------------------+
|   key|             value|
+------+------------------+
|animal|[cat, dog, turtle]|
| fruit|   [banana, apple]|
+------+------------------+

root
 |-- key: string (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)

+-----------+
|some_column|
+-----------+
|     banana|
|        cat|
|      apple|
|      other|
|       null|
+-----------+

+-----------+
|some_column|
+-----------+
|      fruit|
|     animal|
|      fruit|
|      other|
|       null|
+-----------+

score 0 · Answer 2 · answered Feb 16 '23 at 17:06

import pandas as pd

lx = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
df = pd.DataFrame({'input': ['banana', 'cat', 'apple', 'other', 'null']})
ls_input = df['input'].to_list()

# invert dict .. see https://stackoverflow.com/questions/483666/reverse-invert-a-dictionary-mapping
lx_inv = {vi: k  for k, v in lx.items() for vi in v}

y = []
for x in ls_input:
    try:
        y.append(lx_inv[x])
    except:
        y.append(x)

df2 = pd.DataFrame(data=y, columns=['output'])

this creates inverted dictionary. not sure what you mean exactly by 'not changing the dictionary' this method creates a new dict for making comparisons. also, there are probably some nuances about duplicates (can there be values that belong to 2 keys in the original dict) and missing/undefined cases, but you need to specify what are the possible cases and desired outcomes for those.

Replacing column values by dict pyspark

2 Answers2