0

I have a sample data as shown below


ProductionDate CostCenterKey AEMainCategoryKey AELossMagnitude
1/1/2022 100030 OAE 84.94871412
1/1/2022 100030 Overdeclared -0.010897228
1/1/2022 100030 UL 15.06218311
1/1/2022 100040 OAE 49.99603575
1/1/2022 100040 UL 50.00001425
1/1/2022 100040 Undeclared 0.003950003
1/2/2022 100030 OAE 71.58823183
1/2/2022 100030 UL 28.36946736
1/2/2022 100030 Undeclared 0.042300804
1/2/2022 100040 OAE 49.99702425
1/2/2022 100040 UL 50.00002575
1/2/2022 100040 Undeclared 0.002950002

I need to transpose AeMaincategoryKey column and need the output in the below form:

There should be 1 row for the combination of ProductionDate and CostCenterKey

ProductionDate CostCenterKey OAE Overdeclared UL Undeclared
1/1/2022 100030 84.94871412 -0.010897228 15.0621831 0
1/1/2022 100040 49.99603575 0 50.0000143 0.00395
1/2/2022 100030 71.58823183 0 28.3694674 0.0423008
1/2/2022 100040 49.99702425 0 50.0000258 0.00295

I am writing the below code but its not resulting the desired output.

from pyspark.sql import SparkSessionimport pandas as pd

##creating a Spark 
Dataframespark_df = sqlContext.sql("select * from hive_metastore.asseteffectiveness.asset_effectiveness_maincat where productiondate  in ('2022-01-01','2022-01-02') and costcenterkey in (100030,100040)")
##Converting to Spark Dataframepandas_df = spark_df.toPandas()
pandas_df.pivot_table(index=['ProductionDate','CostCenterKey'], columns=['AEMainCategoryKey'], values='AELossMagnitude',   fill_value=0)
display(pandas_df)
  • Does this answer your question? [Pivot String column on Pyspark Dataframe](https://stackoverflow.com/questions/37486910/pivot-string-column-on-pyspark-dataframe) – Chris Nov 07 '22 at 16:04
  • I tried the way suggested in the link but it result the below error:'Series' objects are mutable, thus they cannot be hashed. Also i dont want avg of the 'AELossMagnitude'.But it should be the exact value for combination of ProductionDate and CostCenterKey – Faizan Arefin Nov 07 '22 at 16:24
  • If you pay attention to the question wizard, it indicates not to post pictures of data. There's no way for me to copy your sample data and try it on my end, otherwise I would go into pyspark and find you an answer. If you can paste your actual data in and format it as code using the formatting bar, I will take a look at it. – Chris Nov 07 '22 at 16:50
  • Hi Chris Just pasted the table data.Now you can copy i guess.Meanwhile i will try the below code and let you know. – Faizan Arefin Nov 07 '22 at 17:18

1 Answers1

0

From original format

   ProductionDate CostCenterKey AEMainCategoryKey  AELossMagnitude
0          1/1/22        100030               OAE        84.948714
1          1/1/22        100030      Overdeclared        -0.010897
2          1/1/22        100030                UL        15.062183
3          1/1/22        100040               OAE        49.996036
4          1/1/22        100040                UL        50.000014
5          1/1/22        100040        Undeclared         0.003950
6          1/2/22        100030               OAE        71.588232
7          1/2/22        100030                UL        28.369467
8          1/2/22        100030        Undeclared         0.042301
9          1/2/22        100040               OAE        49.997024
10         1/2/22        100040                UL        50.000026
11         1/2/22        100040        Undeclared         0.002950

Which I recreated in my code with this:

df = pd.DataFrame({'ProductionDate': ['1/1/22', '1/1/22', '1/1/22', '1/1/22',
                                      '1/1/22', '1/1/22', '1/2/22', '1/2/22',
                                      '1/2/22', '1/2/22', '1/2/22', '1/2/22'],
                   'CostCenterKey': ['100030', '100030', '100030',
                                     '100040', '100040', '100040',
                                     '100030', '100030', '100030',
                                     '100040', '100040', '100040'],
                   'AEMainCategoryKey': ['OAE', 'Overdeclared', 'UL',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',
                                         'OAE', 'UL', 'Undeclared',],
                   'AELossMagnitude': [84.94871412, -0.010897228, 15.06218311,
                                       49.99603575, 50.00001425, 0.003950003,
                                       71.58823183, 28.36946736, 0.042300804,
                                       49.99702425, 50.00002575, 0.002950002]})

I have two versions.

Version 1

df3 = df.groupby(['CostCenterKey', 'ProductionDate']).first().unstack(
    'ProductionDate').reset_index().dropna(axis='columns')
print(df3)

Result of version 1

               CostCenterKey AEMainCategoryKey  ... AELossMagnitude           
ProductionDate                          1/1/22  ...          1/1/22     1/2/22
0                     100030               OAE  ...       84.948714  71.588232
1                     100040               OAE  ...       49.996036  49.997024

Version 2

df2 = df.pivot(index=['CostCenterKey', 'ProductionDate'],
               columns=['AEMainCategoryKey'], values=[
        'AELossMagnitude']).reset_index().dropna(axis='columns')
print(df2)

Result of version 2

                  CostCenterKey ProductionDate AELossMagnitude           
AEMainCategoryKey                                          OAE         UL
0                        100030         1/1/22       84.948714  15.062183
1                        100030         1/2/22       71.588232  28.369467
2                        100040         1/1/22       49.996036  50.000014
3                        100040         1/2/22       49.997024  50.000026
Whoeza
  • 77
  • 7
  • This is not working as expected.ProductionDate and CostCenterKey should show as it is. Values of AEMainCategoryKey should show as column header and AELossMagnitude below it.Basically transpose of AEMainCategoryKey and .AELossMagnitude – Faizan Arefin Nov 07 '22 at 17:27
  • CostCenterKey and ProductionDate are there, but the difference is that I wrote them in reverse order (my mistake). – Whoeza Nov 07 '22 at 17:55
  • You can see the AEMainCategoryKey headers (OAE, UL, ...) and AELossMagnitude (values of your new columns OAE, UL, ...) – Whoeza Nov 07 '22 at 17:56
  • I mean, the data seems to be there, but maybe it's not in the table format you wanted. :/ – Whoeza Nov 07 '22 at 17:57
  • As a tip, if you build an index with the four columns names, then you might be able to do it with unstack(). https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-stacking-and-unstacking – Whoeza Nov 07 '22 at 18:32
  • Hi Whoeza The code that you provided worked so far. – Faizan Arefin Nov 07 '22 at 21:35
  • Glad to hear it, @FaizanArefin . Can you up my answer by +1 point to help me look smarter on my profile? :-)) – Whoeza Nov 08 '22 at 17:50