I use: Python 3.6
and PySpark 2.3.0
. In the following exaple I have only tow items in item
but also I can have more information like first_name
, last_name
, city
.
I have a data frame with the following schema:
|-- email: string (nullable = true)
| -- item: struct(nullable=true)
| | -- item: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- data: string(nullable=true)
| | | | -- fieldid: string(nullable=true)
| | | | -- fieldname: string(nullable=true)
| | | | -- fieldtype: string(nullable=true)
This is my output:
+-----+-----------------------------------------------------------------------------------------+
|email|item |
+-----+-----------------------------------------------------------------------------------------+
|x |[[[Gmail, 32, Email Client, dropdown], [Device uses Proxy Server, 33, Device, dropdown]]]|
|y |[[[IE, 32, Email Client, dropdown], [Personal computer, 33, Device, dropdown]]] |
+-----+-----------------------------------------------------------------------------------------+
I want to transform this data frame to:
+-----+-------------------------------------+
|email|Email Client|Device |
+-----+-------------------------------------+
|x |Gmail |Device uses Proxy Server|
|y |IE |Personal computer |
+-----+-------------------------------------+
I do some transformations:
df = df.withColumn('item', df.item.item)
df = df.withColumn('column_names', df.item.fieldname)
df = df.withColumn('column_values', df.item.data)
And now my output is:
+-----+----------------------+---------------------------------+
|email|column_names |column_values |
+-----+----------------------+---------------------------------+
|x |[Email Client, Device]|[Gmail, Device uses Proxy Server]|
|y |[Email Client, Device]|[IE, Personal computer] |
+-----+----------------------+---------------------------------+
From here I want a method how to zip these columns.