0

Does anyone know how to implement some equivalent of defaultdict in pyspark? So far I have

import pyspark.sql.functions as F
import pandas as pd
from itertools import chain
from collections import defaultdict

example_dict = defaultdict(
                    lambda: "Unmapped",
                    {"zero": 0,
                     "one": 1,
                     "two": 2})

mapping = F.create_map([F.lit(x) for x in chain(*example_dict.items())])

d = {'col1': [0, 1, 2, 3]}
df = pd.DataFrame(data=d)

dfs = spark.createDataFrame(df).withColumn("map", mapping.getItem(F.col("col1")))
dfs.show()

+----+-----+
|col1|  map|
+----+-----+ 
|   0| null|
|   1|  one|
|   2|  two|
|   3|three|
+----+-----+

and apparently the default option is not working properly, however, if i apply this to pandas dataframe it works fine:

df["map"] = df["col1"].map(example_dict)

    col1    map
0   0       Unmapped
1   1       one
2   2       two
3   3       three

I am not sure if it is because pyspark doesn't support defaultdict or I made a mistake in implementation.

Blazej Kowalski
  • 367
  • 1
  • 6
  • 16
  • 1
    dont know if this helps ur case, but couldnt u just replace your nulls with 'unmapped' using a case statement? – murtihash Apr 02 '20 at 22:04
  • 1
    Looks like the keys and values are backwards in your `example_dict`. You can either [fill nulls](https://stackoverflow.com/questions/45065636/pyspark-how-to-fillna-values-in-dataframe-for-specific-columns) (`dfs.fillna("Unmapped", subset="map")`) or use [`when`](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) to do conditional replace. – pault Apr 02 '20 at 22:06

0 Answers0