1

I have a pandas DataFrame containing some values:

                    id  pair      value  subdir
taylor_1e3c_1s_56C  taylor  6_13  -0.398716    run1 
taylor_1e3c_1s_56C  taylor  6_13  -0.397820    run2 
taylor_1e3c_1s_56C  taylor  6_13  -0.397310    run3 
taylor_1e3c_1s_56C  taylor  6_13  -0.390520    run4 
taylor_1e3c_1s_56C  taylor  6_13  -0.377390    run5 
taylor_1e3c_1s_56C  taylor  8_11  -0.393604    run1
taylor_1e3c_1s_56C  taylor  8_11  -0.392899    run2
taylor_1e3c_1s_56C  taylor  8_11  -0.392473    run3
taylor_1e3c_1s_56C  taylor  8_11  -0.389959    run4
taylor_1e3c_1s_56C  taylor  8_11  -0.387946    run5

what I would like to do is to isolate the rows that have the same index, id, and pair, compute the mean and the standard deviation over the value column, and put it all in a new dataframe. Because I have now effectively averaged over all the possible values of subdir, that column should also be removed. So the output should look something like this

                    id  pair      value     error
taylor_1e3c_1s_56C  taylor  6_13  -0.392351 0.013213
taylor_1e3c_1s_56C  taylor  8_11  -0.391376 0.016432

How should I do it in pandas?

A previous question showed me how to just get the mean - but it's not clear to me how to generalise this to get the error on the mean (aka the standard deviation) as well.

Thank you much to everyone :)

jpp
  • 159,742
  • 34
  • 281
  • 339
Ferdinando Randisi
  • 4,068
  • 6
  • 32
  • 43
  • The previous question provides more than enough information for you to solve this problem. Try googling "how to get the standard deviation of a column pandas." – pault Feb 18 '18 at 18:28
  • 1
    Thanks, but it does not, otherwise I wouldn't be asking :) It's not clear to me how to generalise from adding one field to adding two, while I do know how to use std(). I've been programming in various languages for the past 10+ years and just started picking up pandas. If I can't figure it out, surely there must be several others that can't, though it might seem obvious to you. – Ferdinando Randisi Feb 19 '18 at 12:39

3 Answers3

5

You could promote your index to a column and perform a single groupby:

import pandas as pd

df = pd.DataFrame([['taylor', '6_13', -0.398716, 'run1'],
                   ['taylor', '6_13', -0.397820, 'run2'], 
                   ['taylor', '8_11', -0.389959, 'run4'],
                   ['taylor', '8_11', -0.387946, 'run5']],
                  index=['taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C', 'taylor_1e3c_1s_56C'],
                  columns=['id', 'pair', 'value', 'subdir'])

Original Dataframe

Promote index to column:

df['index'] = df.index

index to column

Perform groupby operations:

new_df = df.groupby(['index', 'id', 'pair']).agg({'value': ['mean', 'std']})

Aggregated New Dataframe

Rakmo
  • 1,926
  • 3
  • 19
  • 37
jpp
  • 159,742
  • 34
  • 281
  • 339
2

For getting mean:

mean_df = df['value'].groupby(df['pair']).mean()

For getting Standard Deviation:

std_df = df['value'].groupby(df['pair']).std()

# Rename column to `error`

std_df = std_df.rename(columns={'value':'error'})

Concat the two required series to new dataframe:

new_df = pd.concat([mean_df,std_df],axis=1)

Hope it helps. Will put improved answer soon.

Rakmo
  • 1,926
  • 3
  • 19
  • 37
  • Thanks! This works, but loses the index and id. Also The indices seem to be in different levels for some reason. Do you know how to fix this? – Ferdinando Randisi Feb 19 '18 at 13:04
  • answee by @jp_data_analysis should hold perfect for your case. If not, let me know. I will improve my answer. – Rakmo Feb 19 '18 at 14:02
1

Here's a solution that I adapted from the answer to your previous question and this post on how to flatten a hierarchical index in columns.

# create dataframe
import pandas as pd
from StringIO import StringIO

text = """                    id  pair      value  subdir
taylor_1e3c_1s_56C  taylor  6_13  -0.398716    run1 
taylor_1e3c_1s_56C  taylor  6_13  -0.397820    run2 
taylor_1e3c_1s_56C  taylor  6_13  -0.397310    run3 
taylor_1e3c_1s_56C  taylor  6_13  -0.390520    run4 
taylor_1e3c_1s_56C  taylor  6_13  -0.377390    run5 
taylor_1e3c_1s_56C  taylor  8_11  -0.393604    run1
taylor_1e3c_1s_56C  taylor  8_11  -0.392899    run2
taylor_1e3c_1s_56C  taylor  8_11  -0.392473    run3
taylor_1e3c_1s_56C  taylor  8_11  -0.389959    run4
taylor_1e3c_1s_56C  taylor  8_11  -0.387946    run5"""
df = pd.DataFrame.from_csv(StringIO(text), sep="\s+")

First do the aggregations using groupby() and agg() and then call reset_index():

df1 = df.groupby([df.index, df['id'], df['pair']]).agg({'value': ['mean', 'std']}).reset_index(level=[1,2])

Now rename columns:

df1.columns = ['id', 'pair', 'value', 'error']

Output:

#                        id  pair     value     error
#taylor_1e3c_1s_56C  taylor  6_13 -0.392351  0.008975
#taylor_1e3c_1s_56C  taylor  8_11 -0.391376  0.002359
pault
  • 41,343
  • 15
  • 107
  • 149