Spark/Python Dataframe - Transpose rows to columns

Question

I have a sample data as shown below

ProductionDate	CostCenterKey	AEMainCategoryKey	AELossMagnitude
1/1/2022	100030	OAE	84.94871412
1/1/2022	100030	Overdeclared	-0.010897228
1/1/2022	100030	UL	15.06218311
1/1/2022	100040	OAE	49.99603575
1/1/2022	100040	UL	50.00001425
1/1/2022	100040	Undeclared	0.003950003
1/2/2022	100030	OAE	71.58823183
1/2/2022	100030	UL	28.36946736
1/2/2022	100030	Undeclared	0.042300804
1/2/2022	100040	OAE	49.99702425
1/2/2022	100040	UL	50.00002575
1/2/2022	100040	Undeclared	0.002950002

I need to transpose AeMaincategoryKey column and need the output in the below form:

There should be 1 row for the combination of ProductionDate and CostCenterKey

ProductionDate	CostCenterKey	OAE	Overdeclared	UL	Undeclared
1/1/2022	100030	84.94871412	-0.010897228	15.0621831	0
1/1/2022	100040	49.99603575	0	50.0000143	0.00395
1/2/2022	100030	71.58823183	0	28.3694674	0.0423008
1/2/2022	100040	49.99702425	0	50.0000258	0.00295

I am writing the below code but its not resulting the desired output.

from pyspark.sql import SparkSessionimport pandas as pd

##creating a Spark 
Dataframespark_df = sqlContext.sql("select * from hive_metastore.asseteffectiveness.asset_effectiveness_maincat where productiondate  in ('2022-01-01','2022-01-02') and costcenterkey in (100030,100040)")
##Converting to Spark Dataframepandas_df = spark_df.toPandas()
pandas_df.pivot_table(index=['ProductionDate','CostCenterKey'], columns=['AEMainCategoryKey'], values='AELossMagnitude',   fill_value=0)
display(pandas_df)

Does this answer your question? [Pivot String column on Pyspark Dataframe](https://stackoverflow.com/questions/37486910/pivot-string-column-on-pyspark-dataframe) — Chris, Nov 07 '22 at 16:04
I tried the way suggested in the link but it result the below error:'Series' objects are mutable, thus they cannot be hashed. Also i dont want avg of the 'AELossMagnitude'.But it should be the exact value for combination of ProductionDate and CostCenterKey — Faizan Arefin, Nov 07 '22 at 16:24
If you pay attention to the question wizard, it indicates not to post pictures of data. There's no way for me to copy your sample data and try it on my end, otherwise I would go into pyspark and find you an answer. If you can paste your actual data in and format it as code using the formatting bar, I will take a look at it. — Chris, Nov 07 '22 at 16:50
Hi Chris Just pasted the table data.Now you can copy i guess.Meanwhile i will try the below code and let you know. — Faizan Arefin, Nov 07 '22 at 17:18

Whoeza · Accepted Answer · 2022-11-07T17:24:28.107

From original format

   ProductionDate CostCenterKey AEMainCategoryKey  AELossMagnitude
0          1/1/22        100030               OAE        84.948714
1          1/1/22        100030      Overdeclared        -0.010897
2          1/1/22        100030                UL        15.062183
3          1/1/22        100040               OAE        49.996036
4          1/1/22        100040                UL        50.000014
5          1/1/22        100040        Undeclared         0.003950
6          1/2/22        100030               OAE        71.588232
7          1/2/22        100030                UL        28.369467
8          1/2/22        100030        Undeclared         0.042301
9          1/2/22        100040               OAE        49.997024
10         1/2/22        100040                UL        50.000026
11         1/2/22        100040        Undeclared         0.002950

Which I recreated in my code with this:

df = pd.DataFrame({'ProductionDate': ['1/1/22', '1/1/22', '1/1/22', '1/1/22',
                                      '1/1/22', '1/1/22', '1/2/22', '1/2/22',
                                      '1/2/22', '1/2/22', '1/2/22', '1/2/22'],
                   'CostCenterKey': ['100030', '100030', '100030',
                                     '100040', '100040', '100040',
                                     '100030', '100030', '100030',
                                     '100040', '100040', '100040'],
                   'AEMainCategoryKey': ['OAE', 'Overdeclared', 'UL',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',],
                   'AELossMagnitude': [84.94871412, -0.010897228, 15.06218311,
                                       49.99603575, 50.00001425, 0.003950003,
                                       71.58823183, 28.36946736, 0.042300804,
                                       49.99702425, 50.00002575, 0.002950002]})

I have two versions.

Version 1

df3 = df.groupby(['CostCenterKey', 'ProductionDate']).first().unstack(
    'ProductionDate').reset_index().dropna(axis='columns')
print(df3)

Result of version 1

               CostCenterKey AEMainCategoryKey  ... AELossMagnitude           
ProductionDate                          1/1/22  ...          1/1/22     1/2/22
0                     100030               OAE  ...       84.948714  71.588232
1                     100040               OAE  ...       49.996036  49.997024

Version 2

df2 = df.pivot(index=['CostCenterKey', 'ProductionDate'],
               columns=['AEMainCategoryKey'], values=[
        'AELossMagnitude']).reset_index().dropna(axis='columns')
print(df2)

Result of version 2

                  CostCenterKey ProductionDate AELossMagnitude           
AEMainCategoryKey                                          OAE         UL
0                        100030         1/1/22       84.948714  15.062183
1                        100030         1/2/22       71.588232  28.369467
2                        100040         1/1/22       49.996036  50.000014
3                        100040         1/2/22       49.997024  50.000026

This is not working as expected.ProductionDate and CostCenterKey should show as it is. Values of AEMainCategoryKey should show as column header and AELossMagnitude below it.Basically transpose of AEMainCategoryKey and .AELossMagnitude — Faizan Arefin, Nov 07 '22 at 17:27
CostCenterKey and ProductionDate are there, but the difference is that I wrote them in reverse order (my mistake). — Whoeza, Nov 07 '22 at 17:55
You can see the AEMainCategoryKey headers (OAE, UL, ...) and AELossMagnitude (values of your new columns OAE, UL, ...) — Whoeza, Nov 07 '22 at 17:56
I mean, the data seems to be there, but maybe it's not in the table format you wanted. :/ — Whoeza, Nov 07 '22 at 17:57
As a tip, if you build an index with the four columns names, then you might be able to do it with unstack(). https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-stacking-and-unstacking — Whoeza, Nov 07 '22 at 18:32
Glad to hear it, @FaizanArefin . Can you up my answer by +1 point to help me look smarter on my profile? :-)) — Whoeza, Nov 08 '22 at 17:50

Spark/Python Dataframe - Transpose rows to columns

1 Answers1