1

I've got an imported template which is a table containing names of probes with assigned True or False (if the probes are to be used for QC purposes later), example:

ProbeName QC
probe 1 True
probe 2 True
probe 3 True
probe 4 False
probe 5 False

Secondly I have an imported list of samples, which includes probe names as a merge point and values for the probes.

Second import looks like this:

SampleName ProbeName Value
sample 1 probe 1  0
sample 1 probe 2  0
sample 1 probe 3  0
sample 1 probe 4  0 
sample 1 probe 5  0
sample 2 probe 1  0
sample 2 probe 2  0
sample 2 probe 3  0
sample 2 probe 4  0
sample 2 probe 5  0

Merged together this currently looks like this:

SampleName ProbeName Value QC
sample 1 probe 1  0 True
sample 1 probe 2  0 True
sample 1 probe 3  0 True
sample 1 probe 4  0 False
sample 1 probe 5  0 False
sample 2 probe 1  0 True
sample 2 probe 2  0 True
sample 2 probe 3  0 True
sample 2 probe 4  0 False
sample 2 probe 5  0 False

etc...

The index is defaulted to the number of lines. I've done this with the following code:

template = pd.read_csv("Template.txt", sep='\t') # importing template
datain = pd.read_csv("Data.txt", sep = '\t') # import sample data
data = pd.merge(datain, template, how='left') # merge template and sample data

I tried to make the sample name the index but for some reason when I called the data.values I still could see the numbered index and the sample name was no longer associated. The reason I have a template and merge data in is I have seperate files exported from a genomic analyser and wanted to use this raw output as my main input for my program. The template is there to add the True/False data to the revelant probes which would allow me to create and import different probe lists with different QC probes etc depending on the test analsyed etc.

Ideally I want to be able to iterate over each sample and then iterate over each probe and its values. For example, for all samples, what is the sum of the probes marked as True.

SampleName ProbeName Value
sample 1 probe 1  0
         probe 2  0
         probe 3  0
         probe 4  0 
         probe 5  0
sample 2 probe 1  1
         probe 2  1
         probe 3  1
         probe 4  1
         probe 5  1

I would then wish to be able to use the individual probe values for each sample in later calculations. What is the most efficient way of doing this?

If anyone can give me a rough idea of what I should do or if I am on the right track that would be very much appreciated.

Thank you for reading.

Pygen
  • 15
  • 3
  • 1
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. Can you also add sample of second Dataframe and expected output? – jezrael Mar 11 '18 at 11:02
  • Hello, thank you for reply. I've added an example for the 2nd import. Hopefully it makes more sense now? Thanks again – Pygen Mar 11 '18 at 12:21

1 Answers1

0

I believe you need filter by boolean indexing by column QC and aggregate sum, then map for new column:

df = pd.merge(datain, template, how='left')
print (df)
  SampleName ProbeName  Value     QC
0   sample 1   probe 1      5   True
1   sample 1   probe 2      4   True
2   sample 1   probe 3      3   True
3   sample 1   probe 4      2  False
4   sample 1   probe 5      1  False
5   sample 2   probe 1      1   True
6   sample 2   probe 2      4   True
7   sample 2   probe 3      3   True
8   sample 2   probe 4      2  False
9   sample 2   probe 5      1  False

s = df[df['QC']].groupby('SampleName')['Value'].sum()
df['new'] = df['SampleName'].map(s)
print (df)
  SampleName ProbeName  Value     QC  new
0   sample 1   probe 1      5   True   12
1   sample 1   probe 2      4   True   12
2   sample 1   probe 3      3   True   12
3   sample 1   probe 4      2  False   12
4   sample 1   probe 5      1  False   12
5   sample 2   probe 1      1   True    8
6   sample 2   probe 2      4   True    8
7   sample 2   probe 3      3   True    8
8   sample 2   probe 4      2  False    8
9   sample 2   probe 5      1  False    8

Detail:

print (df[df['QC']])
  SampleName ProbeName  Value    QC
0   sample 1   probe 1      5  True
1   sample 1   probe 2      4  True
2   sample 1   probe 3      3  True
5   sample 2   probe 1      1  True
6   sample 2   probe 2      4  True
7   sample 2   probe 3      3  True

print (s)
SampleName
sample 1    12
sample 2     8
Name: Value, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252