pyspark how to unpivot the csv with two header lines

Question

How can I use pyspark or pandas to achieve the below transformation? Thanks a lot. Source File is csv with following info:

expected:

can you add a code to create your sample data and what you have tried until now and where you are stuck? — Raghu, Aug 03 '20 at 03:27
Does this answer your question? [Unpivot in spark-sql/pyspark](https://stackoverflow.com/questions/42465568/unpivot-in-spark-sql-pyspark) — Raghu, Aug 03 '20 at 03:29

score 1 · Answer 1 · answered Aug 03 '20 at 06:30

I assume that your source CSV file is as follows:

 ,Group_1,Group_2
 ,Sub_Group_1,Sub_Group_1
Maturity,7/31/20,7/31/20
0.1,0.2,0.3
0.2,0.3,0.4

Note that:

there are initial spaces in first 2 rows (names of 2 top MultiIndex levels for the first column), otherwise read_csv would give them Unnamed... default names,
there are actually 3 MultiIndex levels on columns (the third level contains dates).

I read them calling df = pd.read_csv('Input.csv', header=[0,1,2]) and the content is:

               Group_1     Group_2
           Sub_Group_1 Sub_Group_1
  Maturity     7/31/20     7/31/20
0      0.1         0.2         0.3
1      0.2         0.3         0.4

(you should have included it instead of a picture).

The first step is melt:

result = pd.melt(df, id_vars=[(' ', ' ', 'Maturity')],
    var_name=['Group_Name', 'Sub_Group_Name', 'Date'])

Note that in the above code I specified id_vars as a tuple, with 2 top elements as spaces (this is why I made the assumption about initial spaces).

The result is:

   ( ,  , Maturity) Group_Name Sub_Group_Name     Date  value
0               0.1    Group_1    Sub_Group_1  7/31/20    0.2
1               0.2    Group_1    Sub_Group_1  7/31/20    0.3
2               0.1    Group_2    Sub_Group_1  7/31/20    0.3
3               0.2    Group_2    Sub_Group_1  7/31/20    0.4

And the only thing to do is to rename the first column:

result.rename(columns={result.columns[0]: 'Maturity'}, inplace=True)

Now, when you print the result, you will get:

   Maturity Group_Name Sub_Group_Name     Date  value
0       0.1    Group_1    Sub_Group_1  7/31/20    0.2
1       0.2    Group_1    Sub_Group_1  7/31/20    0.3
2       0.1    Group_2    Sub_Group_1  7/31/20    0.3
3       0.2    Group_2    Sub_Group_1  7/31/20    0.4

The row ordering is a bit different, but this is the way melt works. If you are unhappy about this detail, reorder rows according to your needs.

pyspark how to unpivot the csv with two header lines

1 Answers1