0

I have a data with >100k rows and I need to efficiently regroup it from the left DataFrame to the multiindexed right one which indices are sorted by the sum of values in the 3rd column and inside each index 2nd column values are sorted by values in the 3rd column. All sortings are descending.

I have no idea how to do it correctly and already spent whole day figuring it out.

 a   b  c           a sum b  c  %
foo one 1          foo 5 one 3 3/5
foo two 2                two 2 2/5
bar one 1    =>    baz 4 two 3 3/4
baz one 1                one 1 1/4
baz two 3          bar 3 six 2 2/3
foo one 2                one 1 1/3
bar six 2           

UPDATE: The code given by @jezrael works really good but it outputs in this way:

                  %
a   sum b   c      
foo 5   one 3  0.60
        two 2  0.40
        six NaN NaN
baz 4   two 3  0.75
        one 1  0.25
        six NaN NaN
bar 1   one 1  1.00
        two NaN NaN
        six NaN NaN

Is it possible to get rid of these strings with NaN?

UPDATE #2: I've found the problem which gives NaNs problem. This was caused by 'category' data type. How it affects the behavior of the code I don't know. Just pointing out the cause.

yol
  • 19
  • 6

1 Answers1

0

I believe need:

#aggregate sum by a, b columns
df = df.groupby(['a','b'], as_index=False)['c'].sum()
print (df)
     a    b  c
0  bar  one  1
1  baz  one  1
2  baz  two  3
3  foo  one  3
4  foo  two  2

#create new column by position with transform sum per a column
df.insert(1, 'sum', df.groupby('a')['c'].transform('sum'))
#division of columns
df['%'] = df['c'].div(df['sum'])
print (df)
     a  sum    b  c     %
0  bar    1  one  1  1.00
1  baz    4  one  1  0.25
2  baz    4  two  3  0.75
3  foo    5  one  3  0.60
4  foo    5  two  2  0.40

#sorting by multiple columns and create MultiIndex  
df = df.sort_values(['sum','c'], ascending=False).set_index(['a','sum','b', 'c'])
print (df)
                  %
a   sum b   c      
foo 5   one 3  0.60
        two 2  0.40
baz 4   two 3  0.75
        one 1  0.25
bar 1   one 1  1.00
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for such a fast reply. It seems to be working. Now I need to understand what's going on under the hood. Can you recommend any sources to read about such things? It seems like googling and reading docs didn't work well for me. Thank you again. – yol Jul 04 '18 at 09:22
  • @yol - Sure, in pandas docs are very nice [tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html), I like especially [modern pandas](http://pandas.pydata.org/pandas-docs/stable/tutorials.html#modern-pandas) – jezrael Jul 04 '18 at 09:25
  • The only problem I've found is that column 'b' has all possible occurrences of values (not only the ones with representation in 'c' column). – yol Jul 04 '18 at 09:34
  • @yol - Can you explain more? Maybe you can change data with expected output for better explanation. – jezrael Jul 04 '18 at 10:30
  • @yol - There is bad formating of comments, can you edite question? – jezrael Jul 04 '18 at 10:58
  • @yol - Not sure, if understand each other, I think [edit](https://stackoverflow.com/posts/51169517/edit) question. – jezrael Jul 04 '18 at 11:02
  • @yol - Interesting, what is your pandas version? I cannot simulate your problem. I add dataframes to answer, do you get same ouptut with data sample? – jezrael Jul 04 '18 at 11:07
  • Yes I understood you, I edited the question. I'm sorry, it feels like I'm actually trying to ask you to solve my task. I really appreciate your help. Pandas v.0.23 – yol Jul 04 '18 at 11:08
  • I tried to run your code right the way you edited your post and it seems like the problem occurs right after the first line of code. – yol Jul 04 '18 at 11:19
  • Waht is `print (df.to_dict())` ? – jezrael Jul 04 '18 at 11:22
  • What is your pandas version? – jezrael Jul 04 '18 at 11:24
  • 1
    I recreated exampled DataFrame and it works just fine while my real data still gives me these NaN. This means that your code works perfectly and the problem is with my data. Now I have to understand what it is. P.S. Output for working df: {'%': {('foo', 5, 'one', 3): 0.6, ...}} Output for real data is really big but seems to have the same pattern except for having these NaNs. P.P.S. Pandas v.0.23 as I mentioned before – yol Jul 04 '18 at 11:38
  • I've found the problem. If you're interested you can read the update#2 in my question. – yol Jul 04 '18 at 13:50
  • You can use `df = df.dropna(how='all')`, check also [this](https://stackoverflow.com/a/13434501) – jezrael Jul 04 '18 at 13:51
  • Thank you very much for support :) – yol Jul 04 '18 at 14:07
  • @yol - It is expected, check [this](http://pandas.pydata.org/pandas-docs/stable/categorical.html#operations). Just convert columns to `string` for prevent it. `cols = ['a','b']` and then `df[cols] = df[cols].astype(str)` – jezrael Jul 04 '18 at 14:10
  • 1
    Yes, I already did that. Actually I cancelled convertion to `category` which I was implementing to optimize memory usage. – yol Jul 04 '18 at 19:01