0

I'm trying to replicate the solution of this question in PySpark (Spark < 2.3, so no map_keys): How to get keys and values from MapType column in SparkSQL DataFrame Below is my code (same df of the linked question above):

import pyspark.sql.functions as F

distinctKeys = df\
  .select(F.explode("alpha"))\
  .select("key")\
  .distinct()\
  .rdd

df.select("id", distinctKeys.map(lambda x: "alpha".getItem(x).alias(x))

However, this code gives the error: AttributeError: 'PipelineRDD' object has no attribute '_get_object_id'. Any thoughts on how to fix it?

koiralo
  • 22,594
  • 6
  • 51
  • 72
Egodym
  • 453
  • 1
  • 8
  • 23

1 Answers1

0

Try to create distinctKeys as a list of strings, then use list comprehension to set each key on its own column:

import pyspark.sql.functions as F

# generate a list of distinct keys from the MapType column
distinctKeys = df.select(F.explode("alpha")).agg(F.collect_set("key").alias('keys')).first().keys
# or use your existing method
# distinctKeys = [ d.key for d in df.select(F.explode("alpha")).select("key").distinct().collect() ]

df_new = df.select("id", *[ F.col("alpha")[k].alias(k) for k in distinctKeys ])
jxc
  • 13,553
  • 4
  • 16
  • 34