0

I've been searching around and haven't figured out a way to restructure a dataframe's column to add new columns to the dataframe based on the array contents dynamically. I'm new to python, so I might be searching on the wrong terms and be the reason I haven't found a clear example yet. Please let me know if this is a duplicate and reference link to find it. I think I just need to be pointed in the right direction.

Ok, the details.

The environment is pyspark 2.3.2 and python 2.7

The sample column contains 2 arrays, which they are correlated to each other 1 to 1. I would like to create a column for each value in the titles array and put the corresponding name (in the person array) the respective column.

I cobbled up an example to focus on my problem with changing the dataframe.

import json
from pyspark.sql.types import ArrayType, StructType, StructField, StringType
from pyspark.sql import functions as f

input = { "sample": {    "titles": ["Engineer", "Designer", "Manager"],    "person": ["Mary", "Charlie", "Mac"]  },  "location": "loc a"},{ "sample": {    "titles": ["Engineer", "Owner"],
    "person": ["Tom", "Sue"]  },  "location": "loc b"},{ "sample": {    "titles": ["Engineer", "Designer"],    "person": ["Jane", "Bill"]  },  "location": "loc a"}

a = [json.dumps(input)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)

This is the schema of my dataframe:

In [4]: df.printSchema()
root
 |-- location: string (nullable = true)
 |-- sample: struct (nullable = true)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- titles: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

My dataframe data:

In [5]: df.show(truncate=False)
+--------+-----------------------------------------------------+
|location|sample                                               |
+--------+-----------------------------------------------------+
|loc a   |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|
|loc b   |[[Sue, Tom], [Owner, Engineer]]                      |
|loc a   |[[Jane, Bill], [Engineer, Designer]]                 |
+--------+-----------------------------------------------------+

And what I would like my dataframe to look like:

+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|location|sample                                               |Engineer    |Desginer   |Manager  | Owner   |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|loc a   |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|Mary        |Charlie    |Mac      |         |
|loc b   |[[Sue, Tom], [Owner, Engineer]]                      |Tom         |           |         |Sue      |
|loc a   |[[Jane, Bill], [Engineer, Designer]]                 |Jane        |Bill       |         |         |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+

I've tried to use the explode function, only to end up with more records with the array field in each record. There have been some examples in stackoverflow, but they have static column names. This dataset can have them in any order and new titles can be added later.

deanw
  • 95
  • 1
  • 9

2 Answers2

2
  1. Without explode

  2. With explode

    • Add unique id using monotonically_increasing_id.
    • Use one of the methods show in Pyspark: Split multiple array columns into rows to explode both arrays together or explode the map created with the first method.
    • pivot the result, grouping by added id and other fields you want to preserve, pivot by title and taking first(person)
  • Thank you for the quick response. I like the first approach (without explode), it looks cleaner and more straight forward. The dictionary that gets created is backwards, the key should be the titles. I changed the udf zip statement to read "return dict(zip(x[1],x[0])) if x else None" Is this proper in python or should I be using a different approach? – deanw Nov 03 '18 at 22:14
0

@user10601094 helped me get this question answered. I'm posting the full solution below to help anyone else that might have a similar question

I'm not very fluent in python, so please feel free to suggest better approaches

In [1]: import json
   ...: from pyspark.sql import functions as f
   ...: 

In [2]: # define a sample data set
   ...: input = { "sample": {    "titles": ["Engineer", "Designer", "Manager"],    "person": ["Mary", "Charlie", "Mac"]  },  "location": "loc a"},{ "sample": {    "titles": ["Engineer", "Owner"],
   ...:     "person": ["Tom", "Sue"]  },  "location": "loc b"},{ "sample": {    "titles": ["Engineer", "Designer"],    "person": ["Jane", "Bill"]  },  "location": "loc a"}

In [3]: # create a dataframe with the sample json data
   ...: a = [json.dumps(input)]
   ...: jsonRDD = sc.parallelize(a)
   ...: df = spark.read.json(jsonRDD)
   ...: 
2018-11-03 20:48:09 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

In [4]: # Change the array in the sample column to a dictionary
   ...: # swap the columns so the titles are the key
   ...: 
   ...: # UDF to convert 2 arrays into a map
   ...: @f.udf("map<string,string>")
   ...: def as_dict(x):
   ...:     return dict(zip(x[1],x[0])) if x else None
   ...: 

In [5]: # create a new dataframe based on the original dataframe
   ...: dfmap = df.withColumn("sample", as_dict("sample"))

In [6]: # Convert sample column to be title columns based on the map
   ...: 
   ...: # get the columns names, stored in the keys
   ...: keys = (dfmap
   ...:     .select(f.explode("sample"))
   ...:     .select("key")
   ...:     .distinct()
   ...:     .rdd.flatMap(lambda x: x)
   ...:     .collect())

In [7]: # create a list of column names 
   ...: exprs = [f.col("sample").getItem(k).alias(k) for k in keys]
   ...: 

In [8]: dfmap.select(dfmap.location, *exprs).show()
+--------+--------+--------+-------+-----+
|location|Designer|Engineer|Manager|Owner|
+--------+--------+--------+-------+-----+
|   loc a| Charlie|    Mary|    Mac| null|
|   loc b|    null|     Tom|   null|  Sue|
|   loc a|    Bill|    Jane|   null| null|
+--------+--------+--------+-------+-----+
deanw
  • 95
  • 1
  • 9