2

I have a pandas DataFrame containing some values:

                        id  pair      value  subdir
taylor_1e3c_1s_56C  taylor  6_13  -0.398716    run1 
taylor_1e3c_1s_56C  taylor  6_13  -0.397820    run2 
taylor_1e3c_1s_56C  taylor  6_13  -0.397310    run3 
taylor_1e3c_1s_56C  taylor  6_13  -0.390520    run4 
taylor_1e3c_1s_56C  taylor  6_13  -0.377390    run5 
taylor_1e3c_1s_56C  taylor  8_11  -0.393604    run1
taylor_1e3c_1s_56C  taylor  8_11  -0.392899    run2
taylor_1e3c_1s_56C  taylor  8_11  -0.392473    run3
taylor_1e3c_1s_56C  taylor  8_11  -0.389959    run4
taylor_1e3c_1s_56C  taylor  8_11  -0.387946    run5

what I would like to do is to isolate the rows that have the same index, id, and pair, compute the mean over the value column, and put it all in a new dataframe. Because I have now effectively averaged over all the possible values of subdir, that column should also be removed. So the output should look something like this

                        id  pair      value
taylor_1e3c_1s_56C  taylor  6_13  -0.392351
taylor_1e3c_1s_56C  taylor  8_11  -0.391376

How should I do it in pandas?

Ferdinando Randisi
  • 4,068
  • 6
  • 32
  • 43

1 Answers1

3

Use syntactic sugar - groupby by Series and indices and aggregate mean:

df = df['value'].groupby([df.index, df['id'], df['pair']]).mean().reset_index(level=[1,2])
print (df)
                        id  pair     value
taylor_1e3c_1s_56C  taylor  6_13 -0.392351
taylor_1e3c_1s_56C  taylor  8_11 -0.391376

Classic solution - first reset_index for column from indices and then groupby by columns names and aggregate mean:

df = df.reset_index().groupby(['index','id','pair'])['value'].mean().reset_index(level=[1,2])
print (df)
                        id  pair     value
index                                     
taylor_1e3c_1s_56C  taylor  6_13 -0.392351
taylor_1e3c_1s_56C  taylor  8_11 -0.391376

Detail:

print (df.reset_index())
                index      id  pair     value subdir
0  taylor_1e3c_1s_56C  taylor  6_13 -0.398716   run1
1  taylor_1e3c_1s_56C  taylor  6_13 -0.397820   run2
2  taylor_1e3c_1s_56C  taylor  6_13 -0.397310   run3
3  taylor_1e3c_1s_56C  taylor  6_13 -0.390520   run4
4  taylor_1e3c_1s_56C  taylor  6_13 -0.377390   run5
5  taylor_1e3c_1s_56C  taylor  8_11 -0.393604   run1
6  taylor_1e3c_1s_56C  taylor  8_11 -0.392899   run2
7  taylor_1e3c_1s_56C  taylor  8_11 -0.392473   run3
8  taylor_1e3c_1s_56C  taylor  8_11 -0.389959   run4
9  taylor_1e3c_1s_56C  taylor  8_11 -0.387946   run5

After aggregate mean get MultiIndex with 3 levels:

print (df.reset_index().groupby(['index','id','pair'])['value'].mean())
index               id      pair
taylor_1e3c_1s_56C  taylor  6_13   -0.392351
                            8_11   -0.391376
Name: value, dtype: float64

So is necessesary reset_index for convert second ant third level to columns:

print (df.reset_index()
        .groupby(['index','id','pair'])['value']
        .mean()
        .reset_index(level=[1,2]))
                        id  pair     value
index                                     
taylor_1e3c_1s_56C  taylor  6_13 -0.392351
taylor_1e3c_1s_56C  taylor  8_11 -0.391376
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks much! I figured out most of it by looking the different concepts up, but why do I need to reset the index? And what does it mean to resect it at level 1,2? – Ferdinando Randisi Feb 07 '18 at 13:58
  • I edit answer, thank you for comment. If something unclear, give me know. – jezrael Feb 07 '18 at 14:01