R: check sample against ref column and dependingly add sample data to ref dataset

Question

I'm a beginner with R (and coding in general). In January 14 I hopefully can begin and finish a R course, but I would like to learn before. I have understanding of the basics and have used functions like read.table,intersect,cbind,paste,write.table. But I only was able to achieve partially what I want to do with two input files (shortened samples):

REF.CSV

SNP,Pos,Mut,Hg  
M522 L16 S138 PF3493 rs9786714,7173143,G->A,IJKLT-M522  
P128 PF5504 rs17250121,20837553,C->T,KLT-M9  
M429 P125 rs17306671,14031334,T->A,IJ-M429  
M170 PF3715 rs2032597,14847792,A->C,I-M170  
M304 Page16 PF4609 rs13447352,22749853,A->C,J-M304  
M172 Page28 PF4908 rs2032604,14969634,T->G,J2-M172  
L228,7771358,C->T,J2-M172  
L212,22711465,T->C,J2a-M410

SAMPLE.CSV

SNP,Chr,Allele1,Allele2  
L16,Y,A,A  
P128,Y,C,C  
M170,Y,A,A  
P123,Y,C,C  
M304,Y,C,C  
M172,Y,T,G  
L212,Y,-0,-0

Description what I like to do:

A) Check if SAMPLE.SNP is in REF.SNP  
B) if YES check SAMPLE.Allele status (first read, second read) vs REF.Mut (Ancestral->Derived)  
  B1) if both Alleles are the same and match Derived create output "+ Allele1-Allele2"  
  B2) if both Alleles are the same and match Ancestral create output "- Allele1-Allele2"  
  B3) if Alleles are not the same check if Allele2 is Derived and create output "+ Allele1-Allele2"  
  B4) if both Alleles are "-0" create output "? NC"  
 B5) else create output "? Allele1-Allele2"  
B6) if NO create output "? NA"  
C) Write REF.CSV + output in new row (Sample) and create OUTPUT file

OUTPUT.CSV (like wanted)

SNP,Pos,Mut,Hg,Sample  
M522 L16 S138 PF3493 rs9786714,7173143,G->A,IJKLT-M522,+ A-A  
P128 PF5504 rs17250121,20837553,C->T,KLT-M9,- C-C  
M429 P125 rs17306671,14031334,T->A,IJ-M429,? NA  
M170 PF3715 rs2032597,14847792,A->C,I-M170,- A-A  
M304 Page16 PF4609 rs13447352,22749853,A->C,J-M304,+ C-C  
M172 Page28 PF4908 rs2032604,14969634,T->G,J2-M172,+ T-G  
L228,7771358,C->T,J2-M172,? NA  
L212,22711465,T->C,J2a-M410,? NC

What functions I have found interesting and tried so far.
Variant1: A) is done, but I guess it is not possible to write C) with this? Have not tried to code down B) here

GT <- read.table("SAMPLE.CSV",sep=',',skip=1)[,c(1,3,4)]  
REF <- read.table("REF.CSV",sep=',')  
rownames(REF) <- REF[,1]  
COMMON <- intersect(rownames(GT),rownames(REF))  
REF <- REF[COMMON,]  
GT <- GT[COMMON,]  
GT<-cbind(REF,paste(GT[,2],'-',X[,3],sep=','))  
write.table(GT,file='OUTPUT.CSV',quote=F,row.names=F,col.names=F‌)

Variant2: This is probably a complete mess, forgive me. I was just rying to build a solution on for if looping functions, but I haven't understood R's syntax and logic in this probably. I was not able to get this to run - A) and C) Have not tried to code down B) here

GT<-read.table("SAMPLE.CSV",sep=',',skip=1)[,c(1,3,4)]
rownames(GT)<-GT[,1]
REF <- read.table("REF.CSV",sep=',')
rownames(REF)<-REF[,1]
for (i in (nrow(REF))) {
   for (j in (nrow(GT))) {
       if (GT[j,] %in% REF[i,]) {
       ROWC[i,]<-cbind(REF[i,],paste(GT[j,2],"-",GT[j,3],sep=',')) 
       } else {
       ROWC[i,]<-cbind(REF[i,],"NA",sep=',') 
       }
   }   
}
write.table(ROWC,file='OUTPUT.CSV',quote=F,row.names=F,col.names=F)

I would be just happy if you can indicate what logic/functions would lead to reach the task I have described. I will then try to figure it out. Thx.

What have you tried so far? This site is dedicated to solving your problems, not doing your work. — Roman Luštrik, Dec 17 '13 at 08:29
I sincerely would prefer to start with some code. But since I don't think I have an idea how to reach what I want, I have not listed some of my failed examples. The best try: `GT<-read.table(SAMPLE.CSV,sep=',',skip=1)[,c(1,3,4)] REF <- read.table(REF.CSV,sep=',') rownames(REF)<-REF[,1] COMMON<-intersect(rownames(GT),rownames(REF)) REF <- REF[COMMON,] GT<-GT[COMMON,] GT<-cbind(REF,paste(GT[,2],'-',X[,3],sep='')) write.table(GT[order(GT[,1]),],file='OUTPUT.CSV',quote=F,row.names=F,col.names=F)` — ChrisR, Dec 17 '13 at 08:56
Please add this to your question. Additionally, it would help if you made your problem reproducible. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Dec 17 '13 at 12:09
Thanks, I will try to reformulate my question. I don't request for a completely coded answer, I would be just happy to receive an indication what logic/functions should be used to reach my goal. — ChrisR, Dec 17 '13 at 13:30
You may be looking for `merge` to match up the data sets. After that I would use `ifelse` or something similar to create the columns that you're looking for. It's unclear to me why REF.csv has more information in the first column (e.g. `M522 L16 S138 PF3493 rs9786714` instead of just `L16` for the first row) than is necessary in the merge. — Blue Magister, Dec 25 '13 at 22:31
Thanks BM for the hints. REF.csv has more information, because those names are synonyms and SAMPLE.CSV just uses one of them (not always the same). I think I will try to finish this example during the R course. — ChrisR, Jan 09 '14 at 02:32

R: check sample against ref column and dependingly add sample data to ref dataset

0 Answers0