2

I have the following dataframe :

datas = [['RAC1','CD0287',1.52,1.40,1.45,1.51], ['RAC1','CD0695',2.08,1.40,1.45,1.51], ['RAC1','ADN103-1',2.01,1.40,1.45,1.51], ['RAC3','CD0258',1.91,1.38,1.43,1.45], ['RAC3','ADN103-3',1.66,1.38,1.43,1.45], ['RAC8','CD0558',1.32,1.42,1.48,1.53], ['RAC8','ADN103-8',2.89,1.42,1.48,1.53]]
labels = ['Plate', 'Sample', 'LogRatio', 'm1', 'm2', 'm3']
df = pd.DataFrame(data = datas, columns=labels)

Plate  Sample   LogRatio  m1    m2    m3    
RAC1   CD0287    1.52     1.40  1.45  1.51
RAC1   CD0695    2.08     1.40  1.45  1.51
RAC1   ADN103-1  2.01     1.40  1.45  1.51
RAC3   CD0258    1.91     1.38  1.43  1.45
RAC3   ADN103-3  1.66     1.38  1.43  1.45
RAC8   CD0558    1.32     1.42  1.48  1.53
RAC8   ADN103-8  2.89     1.42  1.48  1.53

I would like to add a new column to calcul the mean M of m1, m2, m3 AND the value LogRatio of ADN103 but I don't know how to add just a value by plate. What I want is :

df['M'] = (df['m1'] + df['m2'] + df['m3'] + LogRatio_ADN103_of_the_plate)/4

For example for the first line of my dataframe, the calcul is :

df['M'] = (1.40 + 1.45 + 1.51 + 2.01) / 4

Plate  Sample   LogRatio  m1    m2    m3     M     
RAC1   CD0287    1.52     1.40  1.45  1.51   1,5925
RAC1   CD0695    2.08     1.40  1.45  1.51
RAC1   ADN103-1  2.01     1.40  1.45  1.51
RAC3   CD0258    1.91     1.38  1.43  1.45
RAC3   ADN103-3  1.66     1.38  1.43  1.45
RAC8   CD0558    1.32     1.42  1.48  1.53
RAC8   ADN103-8  2.89     1.42  1.48  1.53

Because 2.01 is the LogRatio value of ADN103 on plate RAC1. I know how to get the ADN103 value for all plates :

expreg = "ADN103_RAC."
ADN103 = df[df['Sample'].str.contains(expreg, regex=True)]
logRatio_ADN103 = ADN103['Log Ratio']

I tried a transformation in a new column by selecting only ADN103 values but I can't get their LogRatio value, it just retrun a boolean

df['ADN103oftheplate'] = df.groupby('Plate')['Sample'].transform(lambda x: x.str.contains(expreg, regex=True))

I don't know if it's clear. I tried so many ways I'm totally lost now.

Thanks for any help.

Elysire
  • 693
  • 10
  • 23
  • If you include the actual code to make your original dataframe it would be better than just copying and pasting the numbers. It makes it easier for other people to work with. – Alex Dec 14 '16 at 00:50
  • My dataframe actually come from a csv file I read and it's not a small one but I created a code to have the same dataframe that is in my example. Thanks for the advice. – Elysire Dec 14 '16 at 09:03

1 Answers1

2

It's helpful to think about what the nature of your different columns is. In this case your "Plate" and "Sample" columns really seem more like index information. So first I turned the "Plate" column into the index to make it easier to slice the dataframe:

import pandas as pd
import numpy as np

datas = [['RAC1','CD0287',1.52,1.40,1.45,1.51], ['RAC1','CD0695',2.08,1.40,1.45,1.51], ['RAC1','ADN103-1',2.01,1.40,1.45,1.51], ['RAC3','CD0258',1.91,1.38,1.43,1.45], ['RAC3','ADN103-3',1.66,1.38,1.43,1.45], ['RAC8','CD0558',1.32,1.42,1.48,1.53], ['RAC8','ADN103-8',2.89,1.42,1.48,1.53]]
labels = ['Plate', 'Sample', 'LogRatio', 'm1', 'm2', 'm3']
df = pd.DataFrame(data = datas, columns=labels)

df.set_index('Plate', inplace=True)

Then you can loop over the unique values in the "Plate" index and assign the appropriate Log_Value to a new column for that stretch of the dataframe.

for plate in df.index.unique():
    indx = np.where(df.loc[plate, 'Sample'].str.contains('ADN'))[0][0]
    temp_value = df.loc[plate, 'LogRatio'].iat[indx]
    df.loc[plate, 'ADN_LogValues'] = temp_value

Then just add up the last four columns and divide by four.

df['M'] = df.loc[:, 'm1':'ADN_LogValues'].sum(axis=1)/4.0

This produces:

         Sample  LogRatio    m1    m2    m3  ADN_LogValues       M
Plate
RAC1     CD0287      1.52  1.40  1.45  1.51           2.01  1.5925
RAC1     CD0695      2.08  1.40  1.45  1.51           2.01  1.5925
RAC1   ADN103-1      2.01  1.40  1.45  1.51           2.01  1.5925
RAC3     CD0258      1.91  1.38  1.43  1.45           1.66  1.4800
RAC3   ADN103-3      1.66  1.38  1.43  1.45           1.66  1.4800
RAC8     CD0558      1.32  1.42  1.48  1.53           2.89  1.8300
RAC8   ADN103-8      2.89  1.42  1.48  1.53           2.89  1.8300
Alex
  • 2,154
  • 3
  • 26
  • 49
  • Thank you it works. Is it possible to have the original indexes dataframe back once I have my M column added ? – Elysire Dec 16 '16 at 14:00
  • 1
    Sure, just add this at the end: df.reset_index(inplace=True) The accepted answer in this question gives a more thorough explanation of how to do it: http://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column – Alex Dec 16 '16 at 16:54