0

I have a dataframe with a column (e.g. features) that is in dictionary format, as shown below. I'd like to convert each key in the dict to a column. All the rows have the same dict keys, but different values. I found some solutions on SO that could convert a dict to columns, but under the condition that the keys of the dict is known, in my case, it is unknown. Any suggestion on how to achieve that?

|id1    |id2        |features                                                                |
+-------+-----------+------------------------------------------------------------------------+
|1341205|a232523    |{"attr1.feature1": 0.25, "attr1.feature2": 0.0, "attr2.feature1": -0.43}|
|553654 |a325933    |{"attr1.feature1": 0.3, "attr1.feature2": 0.70, "attr2.feature1": 0.11} |
|573786 |a9923823   |{"attr1.feature1": -0.1, "attr1.feature2": 0.20, "attr2.feature1": 0.12}|

Expected dataframe:

+-------+-----------+---------------+---------------+----------------+
|id1    |id2        |attr1.feature1 |attr1.feature2 |attr2.feature1  |
+-------+-----------+---------------+---------------+----------------+
|1341205|a232523    |0.25           |0.0            |-0.43           |
|553654 |a325933    |0.3            |0.70           |0.11            |
|573786 |a9923823   |-0.1           |0.20           |0.12            |
Steven
  • 14,048
  • 6
  • 38
  • 73
MarieS
  • 3
  • 3
  • What did you try so far? If you know how to convert a dict to columns if you know the keys, you should recover the column names within the values, probably a udf will do the job – Christophe Oct 25 '21 at 08:24
  • can you do a `df.printSchema()`? – Steven Oct 25 '21 at 08:27
  • I have tried the method mentioned in the first answer of: https://stackoverflow.com/questions/53072138/splitting-a-dictionary-in-a-pyspark-dataframe-into-individual-columns. It works by specifying the keys list. But the keys are dynamic, hence in the real application, I won't be able to know it beforehand. – MarieS Oct 25 '21 at 08:35

1 Answers1

1

For dynamic keys, you need to use a MapType with from_json function:

from pyspark.sql import functions as F
from pyspark.sql import types as T

# assuming df is your dataframe

schema = T.MapType(T.StringType(), T.FloatType())

df = df.withColumn("features", F.from_json(F.col("features"), schema=schema))

df.show(truncate=False)
+-------+--------+------------------------------------------------------------------------+
|id1    |id2     |features                                                                |
+-------+--------+------------------------------------------------------------------------+
|1341205|a232523 |[attr1.feature1 -> 0.25, attr1.feature2 -> 0.0, attr2.feature1 -> -0.43]|
|553654 |a325933 |[attr1.feature1 -> 0.3, attr1.feature2 -> 0.7, attr2.feature1 -> 0.11]  |
|573786 |a9923823|[attr1.feature1 -> -0.1, attr1.feature2 -> 0.2, attr2.feature1 -> 0.12] |
+-------+--------+------------------------------------------------------------------------+

df.printSchema()
root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- features: map (nullable = true)
 |    |-- key: string
 |    |-- value: float (valueContainsNull = true)

Technically, from that transformation in MapType, you can directly transform to the representation you want but you have to know beforehand the keys. You can get the unknown keys by performing a first action

# assuming that each line have the same "keys"

keys = df.select(F.map_keys("features").alias("keys")).first().keys

df.select(
    "id1",
    "id2",
    *(F.col("features").getItem(col).alias(col) for col in keys),
).show(truncate=False)
+-------+--------+--------------+--------------+--------------+                 
|id1    |id2     |attr1.feature1|attr1.feature2|attr2.feature1|
+-------+--------+--------------+--------------+--------------+
|1341205|a232523 |0.25          |0.0           |-0.43         |
|553654 |a325933 |0.3           |0.7           |0.11          |
|573786 |a9923823|-0.1          |0.2           |0.12          |
+-------+--------+--------------+--------------+--------------+

You can also explode the columns for example - the output will be different from what you expected:

df.select(
    "id1",
    "id2",
    F.explode("features").alias("feature", "value"),
).show()

+-------+--------+--------------+-----+                                         
|    id1|     id2|       feature|value|
+-------+--------+--------------+-----+
|1341205|a232523 |attr1.feature1| 0.25|
|1341205|a232523 |attr1.feature2|  0.0|
|1341205|a232523 |attr2.feature1|-0.43|
|553654 |a325933 |attr1.feature1|  0.3|
|553654 |a325933 |attr1.feature2|  0.7|
|553654 |a325933 |attr2.feature1| 0.11|
|573786 |a9923823|attr1.feature1| -0.1|
|573786 |a9923823|attr1.feature2|  0.2|
|573786 |a9923823|attr2.feature1| 0.12|
+-------+--------+--------------+-----+
Steven
  • 14,048
  • 6
  • 38
  • 73