1

Given a data frame as following:

In [8]:
df
Out[8]:
Experiment  SampleVol   Mass
0   A   1   11
1   A   1   12
2   A   2   20
3   A   2   17
4   A   2   21
5   A   3   28
6   A   3   29
7   A   4   35
8   A   4   38
9   A   4   35
10  B   1   12
11  B   1   11
12  B   2   22
13  B   2   24
14  B   3   30
15  B   3   33
16  B   4   37
17  B   4   42
18  C   1   8
19  C   1   7
20  C   2   17
21  C   2   19
22  C   3   29
23  C   3   30
24  C   3   31
25  C   4   41
26  C   4   44
27  C   4   42

I would like to process some correlation study for the data frame of each Experiment. The study I want to conduct is to calculate the correlation of 'SampleVol' with its Mean('Mass').

The groupby function can help me to get the mean of masses. grp = df.groupby(['Experiment', 'SampleVol']) grp.mean()

Out[17]:
                       Mass
Experiment  SampleVol   
A            1         11.500000
             2         19.333333
             3         28.500000
             4         36.000000
B            1         11.500000
             2         23.000000
             3         31.500000
             4         39.500000
C            1          7.500000
             2         18.000000
             3         30.000000
             4         42.333333

I understand for each data frame I should use some numpy function to compute the correlation coefficient. But now, my question is how can I iterate the data frames for each Experiment.

Following is an example of the desired output.

Out[18]:

Experiment  Slope   Intercept
A            0.91   0.01
B            1.1    0.02
C            0.95   0.03

Thank you very much.

ju.
  • 1,016
  • 1
  • 13
  • 34

1 Answers1

2

You'll want to group on just the 'Experiment' column, rather than the two columns as you have above. You can iterate through the groups and perform a simple linear regression on the grouped values using the below code:

from scipy import stats
import pandas as pd 
import numpy as np

grp = df.groupby(['Experiment'])

output = pd.DataFrame(columns=['Slope', 'Intercept'])

for name, group in grp:
    slope, intercept, r_value, p_value, std_err = stats.linregress(group['SampleVol'], group['Mass'])
    output.loc[name] = [slope,intercept]
    
print(output)

enter image description here

For those curious, this is how I generated the dummy data and what it looks like:

df = pd.DataFrame()
df['Experiment'] = np.array(pd.date_range('2018-01-01', periods=12, freq='6h').strftime('%a'))
df['SampleVol'] = np.random.uniform(1,5,12)
df['Mass'] = np.random.uniform(10,42,12)

enter image description here

References:

constantstranger
  • 9,176
  • 2
  • 5
  • 19
moue
  • 166
  • 1
  • 7