Is there any way to convert flatten Dataframe to nested Dataframe using Pyspark?

Question

I have the following dataframe with the schema:

+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
|emp_id|emp_name|job_name|manager_id| hire_date| salary|commission|dep_id|increment_date|country|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
| 64989|  ADELYN|SALESMAN|     66928|1991-02-20|1700.00|    400.00|  3001|    2000-02-20| France|
| 64999|     Raj|SALESMAN|     66928|1991-02-20|1700.00|    400.00|  3001|    2000-02-20|    Ind|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+

root
 |-- emp_id: string (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- job_name: string (nullable = true)
 |-- manager_id: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- salary: string (nullable = true)
 |-- commission: string (nullable = true)
 |-- dep_id: string (nullable = true)
 |-- increment_date: string (nullable = true)
 |-- country: string (nullable = true)

I need to convert this to a nested Dataframe using Pyspark -

+--------------------+-------------+----------+-------+------------+-----------+
|         emp_details|incrementDate|commission|country|    hireDate| reports_to|
+--------------------+-------------+----------+-------+------------+-----------+
|{{64989}, ADELYN,...|   2000-02-20|    400.00| France|{1991-02-20}|{[{66928}]}|
|{{64999}, Raj, SA...|   2000-02-20|    400.00|    Ind|{1991-02-20}|{[{66928}]}|
+--------------------+-------------+----------+-------+------------+-----------+

with following schema;

root
 |-- emp_details: struct (nullable = true)
 |    |-- id: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- position: string (nullable = true)
 |    |-- depId: struct (nullable = true)
 |    |    |-- dep_id: string (nullable = true)
 |-- incrementDate: string (nullable = true)
 |-- commission: string (nullable = true)
 |-- country: string (nullable = true)
 |-- hireDate: struct (nullable = true)
 |    |-- hire_date: string (nullable = true)
 |-- reports_to: struct (nullable = true)
 |    |-- reporting: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- manager_id: string (nullable = true)

I tried with the help of - https://kb.databricks.com/en_US/scala/convert-flat-df-to-nested-json but it is written in Scala that I am unable to understand. I need help to rewrite the code as an apprentice in Pyspark

Ronak Jain · Answer 1 · 2023-03-22T19:07:23.687

This would work:

df=df\
 .withColumn("emp_details", F.struct(F.struct(F.col("emp_id").alias("id")).alias("id"),F.col("emp_name").alias("name"), F.col("job_name").alias("position"), F.struct("dep_id").alias("depId")))\
 .withColumn("hireDate", F.struct("hire_date"))\
 .withColumn("reports_to", F.struct(F.array(F.struct("manager_id")).alias("reporting")))\
 .selectExpr("emp_details","increment_date as incrementDate","commission","country","hireDate","reports_to")

df.show(truncate=False)
df.printSchema()

Input:

+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
|emp_id|emp_name|job_name|manager_id| hire_date| salary|commission|dep_id|increment_date|country|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
| 64989|  ADELYN|SALESMAN|     66928|1991-02-20|1700.00|    400.00|  3001|    2000-02-20| France|
| 64999|     Raj|SALESMAN|     66928|1991-02-20|1700.00|    400.00|  3001|    2000-02-20|    Ind|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+

Output:

    +-----------------------------------+-------------+----------+-------+------------+-----------+
    |emp_details                        |incrementDate|commission|country|hireDate    |reports_to |
    +-----------------------------------+-------------+----------+-------+------------+-----------+
    |{{64989}, ADELYN, SALESMAN, {3001}}|2000-02-20   |400.00    |France |{1991-02-20}|{[{66928}]}|
    |{{64999}, Raj, SALESMAN, {3001}}   |2000-02-20   |400.00    |Ind    |{1991-02-20}|{[{66928}]}|
    +-----------------------------------+-------------+----------+-------+------------+-----------+

Output Schema:

root
 |-- emp_details: struct (nullable = false)
 |    |-- id: struct (nullable = false)
 |    |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- position: string (nullable = true)
 |    |-- depId: struct (nullable = false)
 |    |    |-- dep_id: string (nullable = true)
 |-- incrementDate: string (nullable = true)
 |-- commission: string (nullable = true)
 |-- country: string (nullable = true)
 |-- hireDate: struct (nullable = false)
 |    |-- hire_date: string (nullable = true)
 |-- reports_to: struct (nullable = false)
 |    |-- reporting: array (nullable = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- manager_id: string (nullable = true)

Is there any way to convert flatten Dataframe to nested Dataframe using Pyspark?

1 Answers1