3

I am trying to take row means of every few columns. Here is a sample dataset.

d = {'2000-01': range(0,10), '2000-02': range(10,20), '2000-03': range(10,20),
     '2001-01': range(10,20), '2001-02':range(5,15), '2001-03':range(5,15)}
pd.DataFrame(data=d)

     2000-01  2000-02  2000-03  2001-01  2001-02  2001-03
0        0       10       10       10        5        5
1        1       11       11       11        6        6
2        2       12       12       12        7        7
3        3       13       13       13        8        8
4        4       14       14       14        9        9
5        5       15       15       15       10       10
6        6       16       16       16       11       11
7        7       17       17       17       12       12
8        8       18       18       18       13       13
9        9       19       19       19       14       14

I need to take row means of the first three columns and then the next three and so on in the complete dataset. I don't need the original columns in the new dataset. Here is my code. It works but with caveats (discussed below). I am searching for a cleaner, more elegant solution if possible. (New to Python/Pandas)

#Create empty list to store row means
d1 = []

#Run loop to find row means for every three columns
for i in np.arange(0, 6, 3):
    data1 = d.iloc[:,i:i+3]
    d1.append(data1.mean(axis=1))

#Create empty list to concat DFs later
dlist1 =[]

#Concat DFs
for j in range(0,len(d1)):
    dlist1.append(pd.Series(d1[j]).to_frame())
pd.concat(dlist1, axis = 1)

I get this output, which is correct:

          0          0
0   6.666667   6.666667
1   7.666667   7.666667
2   8.666667   8.666667
3   9.666667   9.666667
4  10.666667  10.666667
5  11.666667  11.666667
6  12.666667  12.666667
7  13.666667  13.666667
8  14.666667  14.666667
9  15.666667  15.666667

The columns names can easily be fixed, but the problem is that I need them in a specific format and I have 65 of these columns in the actual dataset. If you'll notice the column names in the original dataset, they are '2000-01'; '2000-02'; '2000-03'. The 1,2 and 3 are months of the year 2000, therefore column 1 of the new df should be '2000q1' , q1 being quarter 1. How do I loop over column names to create this for all my new columns? This seems significantly more challenging (at least to me!) than what's shown here. Thanks for your time!

EDIT: Ok this has been solved, quick shoutout to everyone who contributed!

2 Answers2

4

We have groupby for axis=1, here using the numpy array get the divisor

df=df.groupby(np.arange(df.shape[1])//3,axis=1).mean()
           0          1
0   6.666667   6.666667
1   7.666667   7.666667
2   8.666667   8.666667
3   9.666667   9.666667
4  10.666667  10.666667
5  11.666667  11.666667
6  12.666667  12.666667
7  13.666667  13.666667
8  14.666667  14.666667
9  15.666667  15.666667

#np.arange(df.shape[1])//3
#array([0, 0, 0, 1, 1, 1])

More common way

df.columns=pd.to_datetime(df.columns,format='%Y-%m').to_period('Q')
df=df.groupby(level=0,axis=1).mean()
      2000Q1     2001Q1
0   6.666667   6.666667
1   7.666667   7.666667
2   8.666667   8.666667
3   9.666667   9.666667
4  10.666667  10.666667
5  11.666667  11.666667
6  12.666667  12.666667
7  13.666667  13.666667
8  14.666667  14.666667
9  15.666667  15.666667
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Nice! That works! Any thoughts on the column name situation? – Sankalp Sharma Jul 06 '20 at 00:24
  • @SankalpSharma share more common way to doing this – BENY Jul 06 '20 at 00:24
  • 1
    Nice one with `to_period` and then groupby. – Erfan Jul 06 '20 at 00:26
  • @Erfan you have the right idea - I actually found a similar approach in another thread. The problem is that at some point - my year will go from 2000 to 2001 and then to 2002 etc. What do I do about that? – Sankalp Sharma Jul 06 '20 at 00:28
  • @YOBEN_S How do I convert the 'Q' to small 'q' - I tried ```df.columns = df.columns.str.replace('Q','q')``` - but it's not working. Thoughts? Other than that the code works great! – Sankalp Sharma Jul 06 '20 at 01:07
  • @SankalpSharma df.columns = df.columns.astype(str).str.replace('Q','q') – BENY Jul 06 '20 at 01:20
  • 1
    Actually what ended up working was this: ```df.columns.to_series().astype(str).str.replace('Q','q')``` - the point being that the column names are a period-index and won't directly convert to ```str``` – Sankalp Sharma Jul 06 '20 at 01:24
2

Iterate with multiple of 3 and concat all the series:

df = (pd.concat([df.iloc[:, i:i+3].mean(1).rename(df.columns[i].split('-')[0]+'q1') 
      for i in range(0, df.shape[1], 3)], axis=1))

print(df)
      2000q1     2001q1
0   6.666667   6.666667
1   7.666667   7.666667
2   8.666667   8.666667
3   9.666667   9.666667
4  10.666667  10.666667
5  11.666667  11.666667
6  12.666667  12.666667
7  13.666667  13.666667
8  14.666667  14.666667
9  15.666667  15.666667
Space Impact
  • 13,085
  • 23
  • 48