I've been searching around and haven't figured out a way to restructure a dataframe's column to add new columns to the dataframe based on the array contents dynamically. I'm new to python, so I might be searching on the wrong terms and be the reason I haven't found a clear example yet. Please let me know if this is a duplicate and reference link to find it. I think I just need to be pointed in the right direction.
Ok, the details.
The environment is pyspark 2.3.2 and python 2.7
The sample column contains 2 arrays, which they are correlated to each other 1 to 1. I would like to create a column for each value in the titles array and put the corresponding name (in the person array) the respective column.
I cobbled up an example to focus on my problem with changing the dataframe.
import json
from pyspark.sql.types import ArrayType, StructType, StructField, StringType
from pyspark.sql import functions as f
input = { "sample": { "titles": ["Engineer", "Designer", "Manager"], "person": ["Mary", "Charlie", "Mac"] }, "location": "loc a"},{ "sample": { "titles": ["Engineer", "Owner"],
"person": ["Tom", "Sue"] }, "location": "loc b"},{ "sample": { "titles": ["Engineer", "Designer"], "person": ["Jane", "Bill"] }, "location": "loc a"}
a = [json.dumps(input)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
This is the schema of my dataframe:
In [4]: df.printSchema()
root
|-- location: string (nullable = true)
|-- sample: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- titles: array (nullable = true)
| | |-- element: string (containsNull = true)
My dataframe data:
In [5]: df.show(truncate=False)
+--------+-----------------------------------------------------+
|location|sample |
+--------+-----------------------------------------------------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|
|loc b |[[Sue, Tom], [Owner, Engineer]] |
|loc a |[[Jane, Bill], [Engineer, Designer]] |
+--------+-----------------------------------------------------+
And what I would like my dataframe to look like:
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|location|sample |Engineer |Desginer |Manager | Owner |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|Mary |Charlie |Mac | |
|loc b |[[Sue, Tom], [Owner, Engineer]] |Tom | | |Sue |
|loc a |[[Jane, Bill], [Engineer, Designer]] |Jane |Bill | | |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
I've tried to use the explode function, only to end up with more records with the array field in each record. There have been some examples in stackoverflow, but they have static column names. This dataset can have them in any order and new titles can be added later.