I've got an imported template which is a table containing names of probes with assigned True or False (if the probes are to be used for QC purposes later), example:
ProbeName QC
probe 1 True
probe 2 True
probe 3 True
probe 4 False
probe 5 False
Secondly I have an imported list of samples, which includes probe names as a merge point and values for the probes.
Second import looks like this:
SampleName ProbeName Value
sample 1 probe 1 0
sample 1 probe 2 0
sample 1 probe 3 0
sample 1 probe 4 0
sample 1 probe 5 0
sample 2 probe 1 0
sample 2 probe 2 0
sample 2 probe 3 0
sample 2 probe 4 0
sample 2 probe 5 0
Merged together this currently looks like this:
SampleName ProbeName Value QC
sample 1 probe 1 0 True
sample 1 probe 2 0 True
sample 1 probe 3 0 True
sample 1 probe 4 0 False
sample 1 probe 5 0 False
sample 2 probe 1 0 True
sample 2 probe 2 0 True
sample 2 probe 3 0 True
sample 2 probe 4 0 False
sample 2 probe 5 0 False
etc...
The index is defaulted to the number of lines. I've done this with the following code:
template = pd.read_csv("Template.txt", sep='\t') # importing template
datain = pd.read_csv("Data.txt", sep = '\t') # import sample data
data = pd.merge(datain, template, how='left') # merge template and sample data
I tried to make the sample name the index but for some reason when I called the data.values I still could see the numbered index and the sample name was no longer associated. The reason I have a template and merge data in is I have seperate files exported from a genomic analyser and wanted to use this raw output as my main input for my program. The template is there to add the True/False data to the revelant probes which would allow me to create and import different probe lists with different QC probes etc depending on the test analsyed etc.
Ideally I want to be able to iterate over each sample and then iterate over each probe and its values. For example, for all samples, what is the sum of the probes marked as True.
SampleName ProbeName Value
sample 1 probe 1 0
probe 2 0
probe 3 0
probe 4 0
probe 5 0
sample 2 probe 1 1
probe 2 1
probe 3 1
probe 4 1
probe 5 1
I would then wish to be able to use the individual probe values for each sample in later calculations. What is the most efficient way of doing this?
If anyone can give me a rough idea of what I should do or if I am on the right track that would be very much appreciated.
Thank you for reading.