I have the following dataframe with the schema:
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
|emp_id|emp_name|job_name|manager_id| hire_date| salary|commission|dep_id|increment_date|country|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
| 64989| ADELYN|SALESMAN| 66928|1991-02-20|1700.00| 400.00| 3001| 2000-02-20| France|
| 64999| Raj|SALESMAN| 66928|1991-02-20|1700.00| 400.00| 3001| 2000-02-20| Ind|
+------+--------+--------+----------+----------+-------+----------+------+--------------+-------+
root
|-- emp_id: string (nullable = true)
|-- emp_name: string (nullable = true)
|-- job_name: string (nullable = true)
|-- manager_id: string (nullable = true)
|-- hire_date: string (nullable = true)
|-- salary: string (nullable = true)
|-- commission: string (nullable = true)
|-- dep_id: string (nullable = true)
|-- increment_date: string (nullable = true)
|-- country: string (nullable = true)
I need to convert this to a nested Dataframe using Pyspark -
+--------------------+-------------+----------+-------+------------+-----------+
| emp_details|incrementDate|commission|country| hireDate| reports_to|
+--------------------+-------------+----------+-------+------------+-----------+
|{{64989}, ADELYN,...| 2000-02-20| 400.00| France|{1991-02-20}|{[{66928}]}|
|{{64999}, Raj, SA...| 2000-02-20| 400.00| Ind|{1991-02-20}|{[{66928}]}|
+--------------------+-------------+----------+-------+------------+-----------+
with following schema;
root
|-- emp_details: struct (nullable = true)
| |-- id: struct (nullable = true)
| | |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- position: string (nullable = true)
| |-- depId: struct (nullable = true)
| | |-- dep_id: string (nullable = true)
|-- incrementDate: string (nullable = true)
|-- commission: string (nullable = true)
|-- country: string (nullable = true)
|-- hireDate: struct (nullable = true)
| |-- hire_date: string (nullable = true)
|-- reports_to: struct (nullable = true)
| |-- reporting: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- manager_id: string (nullable = true)
I tried with the help of - https://kb.databricks.com/en_US/scala/convert-flat-df-to-nested-json but it is written in Scala that I am unable to understand. I need help to rewrite the code as an apprentice in Pyspark