Group by, sort, cumsum normalization pandas python

Question

How to make this work? df['sum_greater_then'] = df.groupby(['scan_number', 'raw_file]).sort_values('rank', ascending=False)['intensity'].cumsum()

I need to normalize the data. For this I first need to add intensities up based on their rank in the spectra. One spectra is when 'scan_number' and 'raw_file' don't change.

intensity	rank	scan_number	raw_file
1,057	3	3006	01640a_BA5
4,03	1	3006	01640a_BA5
2,05	2	3006	01640a_BA5

These are the calculations, which I need to do:

Rank 1: (4,03 + 2,05 + 1,057) / total = 7,137/7,137 = 1

Rank 2: (2,05 + 1,057) / total = 0,435

Rank 3: 1,057 / total = 0,148

There are other columns in the dataframe, I don't need them for this specific calculation though.

I already ranked each intensity, calculated the total and created a column each for those values. I just need to add up the right intensities to divide them by total and get the normalized value in the end.

This doesn't work as well df['sum_greater_then'] = df.groupby(['scan_number', 'raw_file]).apply(lambda x: x.sort_values('rank', ascending=False)['intensity'].cumsum())

score 0 · Answer 1 · answered Jul 21 '22 at 03:44

This is a little clunky, but I believe achieves what you want. First do the sort based on the rank, then groupby and take the cumulative sum. Afterwards, transform the data to normalize by the total sum:

x = df.sort_values('rank', ascending=False).groupby(['scan_number', 'raw_file'])['intensity'].cumsum()
df['sum_greater_than'] = x
df['sum_greater_than'] = df.groupby(['scan_number', 'raw_file'])['sum_greater_than'].transform(lambda x : x / x.max())

Here's the final result:

   intensity  rank  scan_number    raw_file  sum_greater_than
0      1.057     3         3006  01640a_BA5          0.148101
1      4.030     1         3006  01640a_BA5          1.000000
2      2.050     2         3006  01640a_BA5          0.435337

I guess you can do this in one (long) line if you use a custom apply function:

df['sum_greater_than'] = df.sort_values('rank', ascending=False).groupby(['scan_number', 'raw_file'])['intensity'].apply(lambda x: x.cumsum() / x.sum())

score 0 · Answer 2 · answered Jul 21 '22 at 04:20

df = df.sort_values('rank',ascending=False)
df['sum_intensities']=df.groupby(['scan_number','raw_file'])['intensity'].cumsum()

Pandas groupby does not support sort_values().

Step 1 : Sort based on a column(here 'rank')

Step 2 : Groupby and take cumulative sum and save in a variable

score 0 · Answer 3 · answered Jul 21 '22 at 19:37

You don't need to use rank unless the calculation for rank is more complex than in this example. Just see sort_valuies then groupby:

Data set up:

df = pd.DataFrame({'val': np.random.randint(low=1, high=100, size=10),
                   'name': ['a']*5+['b']*5})
df.sort_values(by=['name', 'val'], ascending=[True, True], ignore_index=True, inplace=True)

If you have small data:

# cumsum
df['cumsum'] = df.groupby('name')[['val']].cumsum()

# cumprod:
df['grsum'] = df.groupby('name')[['val']].transform(lambda g: g.sum())
df['cumprod'] = df['cumsum']/df['grsum']

df

If have a big data: then using transform will slow down a lot, instead merge method make it much faster

# sub dataframe sum by group
dfsum = df.groupby('name')[['val']].sum().reset_index().rename(columns={'val':'sum'})

# merge
df = pd.merge(df, dfsum, how='left', on='name')

The rest is similar to above

Group by, sort, cumsum normalization pandas python

3 Answers3