1

I have a DataFrame with 40 columns (columns 0 through 39) and I want to group them four at a time:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.binomial(1, 0.2, (100, 40)))

new_df["0-3"] = df[0] + df[1] + df[2] + df[3]
new_df["4-7"] = df[4] + df[5] + df[6] + df[7]
...
new_df["36-39"] = df[36] + df[37] + df[38] + df[39]

Can I do this in a single statement (or in a better way than summing them separately)? The column names in the new DataFrame are not important.

ayhan
  • 70,170
  • 20
  • 182
  • 203

4 Answers4

2

You could select out the columns and sum on the row axis, like this.

df['0-3'] = df.loc[:, 0:3].sum(axis=1)

A couple things to note:

  1. Summing like this will ignore missing data while df[0] + df[1] ... propagates it. Pass skipna=False if you want that behavior.
  2. Not necessarily any performance benefit, may actually be a little slower.
chrisb
  • 49,833
  • 8
  • 70
  • 70
2

Here's another way to do it:

new_df = df.transpose()  
new_df['Group'] = new_df.index / 4  
new_df = new_df.groupby('Group').sum().transpose()

Note that the divide-by operation here is integer division, not floating-point division.

DalekSec
  • 151
  • 4
1

I don't know if it is the best way to go but I ended up using MultiIndex:

df.columns = pd.MultiIndex.from_product((range(10), range(4)))
new_df = df.groupby(level=0, axis=1).sum()

Update: Probably because of the index, this was faster than the alternatives. The same can be done with df.groupby(df.columns//4, axis=1).sum() faster if you take into account the time for constructing the index. However, the index change is a one time operation and I update the df and take the sum thousands of times so using a MultiIndex was faster for me.

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Please read: https://stackoverflow.com/questions/49970526/need-help-to-solve-the-unnamed-and-to-change-it-in-dataframe-in-pandas/49970566#comment86991353_49970566 – cs95 Apr 23 '18 at 18:20
0

Consider a list comprehension:

df = # your data
df_slices = [df.iloc[x:x+4] for x in range(10)]

Or more generally

df_slices = [df.iloc[x:x+4] for x in range(len(df.columns)/4)]
Alex
  • 2,154
  • 3
  • 26
  • 49