1

Input code:

import pandas as pd
import numpy as np

#Dummy df:
df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter',
'Nicky','Peter','Mina','Peter'],
           'City': ['London','NY','LA','London','NY','HK','NY','HK'],

'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters',
'Graduate','Masters','Graduate','Graduate'],
'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
'Earnings': [27, 23, 21, 66, 24,22,34,65]})

df_pivot=pd.pivot_table(df,values = 'Earnings', index=
['Name','City','Stage'], columns = ['Year','Month'], aggfunc=np.sum,
fill_value=0, margins = True).sort_values('All', ascending=False)
print(df_pivot)

Output pivot table:

Year                    2019          2020                              
All
Month                 201902 201904 202001 202003 202005 202007 202012     
Name  City   Stage                                                         
All                       45     66     27     21     24     34     65  282
Peter London Masters       0     66      0      0      0      0      0   66
      HK     Graduate      0      0      0      0      0      0     65   65
Mina  NY     Graduate      0      0      0      0      0     34      0   34
John  London Masters       0      0     27      0      0      0      0   27
Nicky NY     Graduate      0      0      0      0     24      0      0   24
Boby  NY     Graduate     23      0      0      0      0      0      0   23
Peter HK     Masters      22      0      0      0      0      0      0   22
Mina  LA     Graduate      0      0      0     21      0      0      0   21

Desired output sorted firstly by first column, then within the group by second column and lastly within the group by 3rd column:

Year                    2019          2020                              All
Month                 201902 201904 202001 202003 202005 202007 202012     
Name  City   Stage                                                         
All                       45     66     27     21     24     34     65  282
Peter HK     Graduate      0      0      0      0      0      0     65   65
             Masters      22      0      0      0      0      0      0   22
      London Masters       0     66      0      0      0      0      0   66
Mina  NY     Graduate      0      0      0      0      0     34      0   34
      LA     Graduate      0      0      0     21      0      0      0   21
John  London Masters       0      0     27      0      0      0      0   27
Nicky NY     Graduate      0      0      0      0     24      0      0   24
Boby  NY     Graduate     23      0      0      0      0      0      0   23

Please note how Peter-HK is higher than Peter-London, because sum of Peter-HK (65+22) > sum of Peter-London (66).

In other words: First give me Name with biggest total, then within that name give me City with Biggest total, then within that Name and that City give me Stage with biggest total.

Thank you pawel

  • Not certain what the final result should look like. Have you tried to sort again after sorting by `'All'`. Like this: `df_pivot.sort_values('All', ascending=False).sort_index()` – piRSquared Feb 24 '21 at 15:17
  • Hi Pawel, could you show what the output should look like? – Alonso Ogueda Oliva Feb 24 '21 at 15:24
  • Hello and thank you for quick respond! The end result - I have attached as screen shot from excel. In short words I want to sort first column by "All", then second column by "All" and third column by "All". it would mean that for end result "Peter" is on top as All is (60+23), then for Peter in column City I want to have first HK as its value is 60 and then London with value 23. Does it make sense? Can you look on attached screen shoot, as I am unable to paste text, no idea why.. Thank you! – Paweł Poprawski Feb 24 '21 at 15:25
  • chain a sort index? `df_pivot.sort_values(by="All", ascending=False).sort_index()` ? – anky Feb 24 '21 at 15:27
  • I have updated post - in bottom I wrote the expected result. – Paweł Poprawski Feb 24 '21 at 15:30
  • @anky Can you check bottom of my post where I put expected result, I have tried your formula, but it only moves "All" row from top to bottom. Thank you – Paweł Poprawski Feb 24 '21 at 15:37
  • @Alonso Ogueda Olivia, I have updated bottom of my post with expected results. Thank you – Paweł Poprawski Feb 24 '21 at 15:40
  • @piRSquared, I have updated bottom of my post with expected results. Thank you – Paweł Poprawski Feb 24 '21 at 15:40
  • Is there a way to get the data by using a groupby then pivot the results – Golden Lion Feb 26 '21 at 00:30

2 Answers2

2

Edit after understanding the question even better.

You want to sort on maximal score obtained by a person (defined by Name). Then within that person you want to sort on the individual scores obtained by that person.

In your example, I can get the list with the desired sequence of Name in this way:

import pandas as pd
import numpy as np

#Dummy df:
df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter', 
'Nicky','Peter','Mina','Peter'],
               'City': ['London','NY','LA','London','NY','HK','NY','HK'],

  'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters', 
  'Graduate','Masters','Graduate','Graduate'],
  'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
  'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
  'Earnings': [27, 23, 21, 23, 24,22,34,65]})

# Make the pivot table
df_pivot=pd.pivot_table(df,values = 'Earnings', index= 
  ['Name','City','Stage'], columns = ['Year','Month'], aggfunc=np.sum, 
  fill_value=0, margins = True).sort_values('All', ascending=False)
print('Original table')
print(df_pivot)

def sort_groups(df, group_by_col, sort_by_col, F_asc):
    """Sort a dataframe by a certain level of the MultiIndex

    Args:
        df (pd.DataFrame): Dataframe to sort
        group_by_col (str): name of the index level to sort by
        sort_by_col (str): name of the value column to sort by
        F_asc (bool): Ascending sort - True/False

    Returns:
        pd.Dataframe: Dataframe sorted on given multiindex level
    """

    # Make a list of the desired index sequence based on the max value found in each group
    ind = df.groupby(by=group_by_col).max().sort_values(sort_by_col, ascending=F_asc).index.to_list()

    # Return re-indexed dataframe
    return df.reindex(ind, level=df.index.names.index(group_by_col))

# First level sorting: Name
df_pivot_1 = sort_groups(df_pivot, 'Name', 'All', False)
print('\nSort groups at name level:')
print(df_pivot_1)

# Second level sorting : City
#df_pivot_2 = df_pivot_1.groupby(by='Name').apply(lambda x : sort_groups(x, 'City', 'All', False))
df_pivot_2 =pd.concat([sort_groups(group, 'City', 'All', False) for index, group in df_pivot_1.groupby(by=['Name'])])
print('\nSort groups at city level:')
print(df_pivot_2)

# Third level sorting : Stage
df_pivot_3 = df_pivot_2.groupby(by = ['Name', 'City']).apply(lambda x : sort_groups(x, 'Stage', 'All', False))
print('\nSort groups at stage level:')
print(df_pivot_3)

This solution does not place the All row where you indicate it though. Is this very stringent for you?

regards,

Jan

  • Hi Jan, I tried your code, but it made only my "All" disappear, no sorting was applied. Can you check bottom of my post where I posted what I am expecting in the end. Thank you – Paweł Poprawski Feb 24 '21 at 15:33
  • Thanks for that, now it is clear for me what you intended to obtain. I think the edit of my original post reflects this. – Jan Alexander Feb 24 '21 at 21:38
  • thank you for your response, I have edited my post as with different values your proposal does not work. I will be grateful for help! – Paweł Poprawski Feb 25 '21 at 15:41
  • In the solution I posted yesterday, you could just move the `All` column manually I guess. – Jan Alexander Feb 26 '21 at 15:12
  • Thank you for your another response, it works better, but why I receive duplicated columns in this case? When I run same code as you posted I have twice column "name" in my table, if I expand sorting further (as I have df with more indexed columns, then all of them are duplicated). Can you run your code and you will see column name is duplicated? Thank you – Paweł Poprawski Feb 26 '21 at 15:42
  • That should not be the case of course. This looks more like a bug in pandas to me. I can avoid it using `df_pivot_2 =pd.concat([sort_groups(group, 'City', 'All', False) for index, group in df_pivot_1.groupby(by=['Name'])])` – Jan Alexander Feb 26 '21 at 18:51
  • Hello I did some tests with your code and it works on first look, but later not. Even if I change small thing in the initial table. Can you change : 'Earnings': [27, 23, 21, 23, 24,22,34,65]}) to 'Earnings': [27, 23, 21, 66, 24,22,34,65]}) You will see that Peter-London has "1st place" with value 66, but actually Peter-HK should be first with total value higher (65+22=87). Would you be able to provide the feedback? Apologies for the delay I had holidays. Thank you for your help! – Paweł Poprawski Mar 10 '21 at 16:24
  • I am sorry, but I do not understand what you mean. I do not understand why those two master scores should be summed. In the question you ask, you clearly indicate the scores have to be grouped by _city_ first and then by _stage_. Two scores from a different city (London and HK in this case) are correctly separated by the code if I test this and then correctly sorted as was asked in the original question. If you wish to combine the scores at the same level, that changes the original question. You should start with defining a new pivot table, then you can sort it with the function I defined. – Jan Alexander Mar 13 '21 at 21:21
  • I am sorry if I create confusion, what I ideally need is to sort data column by column and keep the grouping. If we look at first column I want to have totals per name "calculated" and sorted descending order, therefore you agree it should be: Peter (score: 66+65+22, we don't care about other indexes), then rest of the names. Now we have first column sorted and we "fix" it, we go to second column. We need to sort Cities within group, so within "Peter" we again calculate totals for cities and we have London 66 and HK (22+65), therefore City HK should be in first place as total for HK is – Paweł Poprawski Mar 16 '21 at 12:17
  • higher than total for London (for filter Peter). Then we do the same for the rest names/cities. Now we go to the 3rd column and for Peter and city HK we sort "Stage" based on totals again, we know that Graduate has higher value than Masters. – Paweł Poprawski Mar 16 '21 at 12:19
  • I have updated my original post again. Thank you – Paweł Poprawski Mar 16 '21 at 12:37
0

here is an super clean way to combine a groupby with a pivot

  df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter', 
  'Nicky','Peter','Mina','Peter'],
           'City': ['London','NY','LA','London','NY','HK','NY','HK'],

 'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters', 
 'Graduate','Masters','Graduate','Graduate'],
 'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
 'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
 'Earnings': [27, 23, 21, 23, 24,22,34,65]})

grouped=df.groupby(['Name','City','Stage','Year','Month'])['Earnings'].sum()
#print(grouped)
grouped=grouped.reset_index(name='Sum')
fp=grouped.pivot(index=['Name','City','Stage'],columns=['Year','Month'],values='Sum').fillna(0)
fp['Totals'] = fp.sum(axis='columns')
fp["Rank"] = fp.groupby(['Name','City'])['Totals'].sum()

fp = fp.sort_values(by=['Name','Rank','City','Totals'],ascending=[False,False,False,False])

print(fp)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
  • OP specifically requested the `Name` column and `City` column to stay grouped. – Jan Alexander Feb 26 '21 at 15:13
  • if you sort by totals than that rule conflicts. How would you resolve the issue – Golden Lion Feb 26 '21 at 16:28
  • @Golden Lion, I can see in your output that for Peter, City "London" is higher than HK, but sum for HK is higher than one for London, so it should be higher... So first I want to sort by name, then for each name I want to sort city and then for each city under specific name I want to sort stage, – Paweł Poprawski Feb 26 '21 at 20:54
  • I added a sort Boolean list. you can change the sort order for each column. This should give you the results you want. – Golden Lion Mar 01 '21 at 22:06
  • @Golden Lion, I tried your code but it still doesn't work. Sorting for whole "Peter" works, but if you look at "Mina" Mina - LA with value 21 is higher than MINA - NY with value 34... Will appreciate your help. – Paweł Poprawski Mar 10 '21 at 16:32
  • The sort is working by alpha. What do you think is wrong? The sort categories are correct. ‘Name','City','Stage','Totals' – Golden Lion Mar 10 '21 at 16:42
  • Are you saying the sort should be names, total, city, stage? Change the sort order is the solution and the boolean ascend or descend value – Golden Lion Mar 10 '21 at 16:45
  • I have changed your code as I want all columns to be sorted in descending order, so: fp = fp.sort_values(by=['Name','City','Stage','Totals'],ascending=[False, False , False,False]) if I understand correctly. Once you do this, you can see "name" is sorted correctly, but then on column "City" it is not sorted correctly, as we take Peter-London with value 23 is higher than Peter-Hk with total value of (22+65), it should be other way around. First sort on Names, then Sort City by total values (not including Stage) within Name group. Then we sort Stage within Name/City. Does it make sense? – Paweł Poprawski Mar 10 '21 at 16:53
  • The expected result should be: Peter HK Graduate 65 / Peter HK Masters 22/ Peter London Masters 23/ – Paweł Poprawski Mar 10 '21 at 16:54
  • based on what you said the second sort will be total descending. see new vs old – Golden Lion Mar 10 '21 at 16:54
  • This is the pivot, so the groups will have to stay grouped. i.e. in NEW output from you I can see: Peter has 2x City HK and it's split, there should be only one HK and should be higher than London as HK value (65+22) is higher than London value (23). – Paweł Poprawski Mar 10 '21 at 17:01
  • I have to think about this problem. you have two things happening. one is a sort by name and city. another is a change in group sort if the totals are greater. your sorting the city group based on the subtotal of the city. – Golden Lion Mar 10 '21 at 17:03
  • you can take the pivot table results and group by name and city for the subtotals. Create a sort column and assign it to the pivot table dataframe then sort on the sort column – Golden Lion Mar 10 '21 at 17:10
  • Would you be able to write a code for me? This is just a sample data I myself have many more indexed columns, where I need this sorting. Unless it's easier way to do it? Maybe not with pandas pivot? – Paweł Poprawski Mar 10 '21 at 17:13
  • this shows you the subtotal by name and city. The code for using it is not done – Golden Lion Mar 10 '21 at 17:36
  • (see) https://stackoverflow.com/questions/33899369/ranking-order-per-group-in-pandas This is what needs to happen. Rank based on the subtotal of name and city. – Golden Lion Mar 10 '21 at 17:39
  • I used the group by subtotal sub as the rank for the sorting. Looks good – Golden Lion Mar 10 '21 at 18:18
  • post the different set of data to a new question. According to your previous comments this code meets the requirements. you can visually inspect rank numbers and see that the sorting by name an rank will correctly align the groupings by subtotal amounts – Golden Lion Mar 11 '21 at 12:15
  • @Golden Lion, I have updated original post where I changed one value and your code does not work for it. Thank you Pawel – Paweł Poprawski Mar 16 '21 at 12:38
  • It is sort rank as the subtotal of the city. 65+22 is greater than 66, therefore HK sorts before London. Works as designed – Golden Lion Mar 16 '21 at 16:37
  • ... @Golden Lion, there is glitch in your code as I said if I change any number it doesn't work. Please see with 'Earnings': [27, 223, 21, 66, 24,22,34,65] We give "223" to Boby and after I run your code, Boby is still at the bottom of the list, but he should be at the top. – Paweł Poprawski Mar 18 '21 at 15:11
  • What are the subtotal groupings? I thought it was user and city. The ranking is a subtotal amount – Golden Lion Mar 19 '21 at 17:26
  • I just realized that you may not be subtotaling on the ranking. Instead, your looking for the max value. Change the groupby sum() to -> max() and test your output. Now in each group it will find the max value as the rank. – Golden Lion Mar 20 '21 at 12:33
  • I added max(). closely study the ranking value. – Golden Lion Mar 20 '21 at 12:40
  • Your output is incorrect, please see my first post. Peter HK should be above Peter London... as sum of Peter HK (65+22) is higher than Peter London (66). Do you understand? – Paweł Poprawski Mar 23 '21 at 08:32
  • No I don't understand. Neither max nor sum on city give the results you desire. – Golden Lion Mar 23 '21 at 13:46
  • sum on city will work but only within the same name... If we take Peter, which city should be higher in the ranking overall? HK, as it has two values of 65+22, which is bigger than London with single value of 66. I don't know how I can explain this, this is as simple as it gets. – Paweł Poprawski Mar 25 '21 at 08:57
  • https://stackoverflow.com/questions/55349106/pandas-groupby-and-sort-max-values/66912392#66912392 you can use transform to sort the result of a groupby max(). transform will sort the multi-index output – Golden Lion Apr 01 '21 at 22:40