Hive/pyspark: pivot non numeric data for huge dataset

Question

I'm looking for a way to pivot a input dataset with the below structure in hive or pyspark, the input contains more than half a billion records and for each emp_id there are 8 rows with and 5 columns possible, so I will end up with 40 columns. I did refer to this link but here the pivoted output column is already there in the dataset, in mine it's not and I also tried this link, but the sql is becoming very huge (not that it matters), but Is there a much way to do where the resultant pivoted columns needs to concatenated with the rank.

input

emp_id,  dept_id,   dept_name, rank
1001,   101,        sales,      1
1001,   102,        marketing,  2
1002    101,        sales       1
1002    102,        marketing,  2

expected output

emp_id,     dept_id_1, dept_name_1, dept_id_2, dept_id_2
1001,       101,        sales,      102,        marketing
1002,       101,        sales,      102,        marketing

score 2 · Accepted Answer · edited Sep 21 '21 at 05:24

You can use aggregations after pivoting, you'd have an option to rename column like so

import pyspark.sql.functions as F

(df
    .groupBy('emp_id')
    .pivot('rank')
    .agg(
        F.first('dept_id').alias('dept_id'),
        F.first('dept_name').alias('dept_name')
    )
    .show()
)

# Output
# +------+---------+-----------+---------+-----------+
# |emp_id|1_dept_id|1_dept_name|2_dept_id|2_dept_name|
# +------+---------+-----------+---------+-----------+
# |  1002|      101|      sales|      102|  marketing|
# |  1001|      101|      sales|      102|  marketing|
# +------+---------+-----------+---------+-----------+

Hive/pyspark: pivot non numeric data for huge dataset

1 Answers1