I'm a beginner with R (and coding in general). In January 14 I hopefully can begin and finish a R course, but I would like to learn before. I have understanding of the basics and have used functions like read.table,intersect,cbind,paste,write.table. But I only was able to achieve partially what I want to do with two input files (shortened samples):
REF.CSV
SNP,Pos,Mut,Hg
M522 L16 S138 PF3493 rs9786714,7173143,G->A,IJKLT-M522
P128 PF5504 rs17250121,20837553,C->T,KLT-M9
M429 P125 rs17306671,14031334,T->A,IJ-M429
M170 PF3715 rs2032597,14847792,A->C,I-M170
M304 Page16 PF4609 rs13447352,22749853,A->C,J-M304
M172 Page28 PF4908 rs2032604,14969634,T->G,J2-M172
L228,7771358,C->T,J2-M172
L212,22711465,T->C,J2a-M410
SAMPLE.CSV
SNP,Chr,Allele1,Allele2
L16,Y,A,A
P128,Y,C,C
M170,Y,A,A
P123,Y,C,C
M304,Y,C,C
M172,Y,T,G
L212,Y,-0,-0
Description what I like to do:
A) Check if SAMPLE.SNP is in REF.SNP
B) if YES check SAMPLE.Allele status (first read, second read) vs REF.Mut (Ancestral->Derived)
B1) if both Alleles are the same and match Derived create output "+ Allele1-Allele2"
B2) if both Alleles are the same and match Ancestral create output "- Allele1-Allele2"
B3) if Alleles are not the same check if Allele2 is Derived and create output "+ Allele1-Allele2"
B4) if both Alleles are "-0" create output "? NC"
B5) else create output "? Allele1-Allele2"
B6) if NO create output "? NA"
C) Write REF.CSV + output in new row (Sample) and create OUTPUT file
OUTPUT.CSV (like wanted)
SNP,Pos,Mut,Hg,Sample
M522 L16 S138 PF3493 rs9786714,7173143,G->A,IJKLT-M522,+ A-A
P128 PF5504 rs17250121,20837553,C->T,KLT-M9,- C-C
M429 P125 rs17306671,14031334,T->A,IJ-M429,? NA
M170 PF3715 rs2032597,14847792,A->C,I-M170,- A-A
M304 Page16 PF4609 rs13447352,22749853,A->C,J-M304,+ C-C
M172 Page28 PF4908 rs2032604,14969634,T->G,J2-M172,+ T-G
L228,7771358,C->T,J2-M172,? NA
L212,22711465,T->C,J2a-M410,? NC
What functions I have found interesting and tried so far.
Variant1: A) is done, but I guess it is not possible to write C) with this?
Have not tried to code down B) here
GT <- read.table("SAMPLE.CSV",sep=',',skip=1)[,c(1,3,4)]
REF <- read.table("REF.CSV",sep=',')
rownames(REF) <- REF[,1]
COMMON <- intersect(rownames(GT),rownames(REF))
REF <- REF[COMMON,]
GT <- GT[COMMON,]
GT<-cbind(REF,paste(GT[,2],'-',X[,3],sep=','))
write.table(GT,file='OUTPUT.CSV',quote=F,row.names=F,col.names=F)
Variant2: This is probably a complete mess, forgive me. I was just rying to build a solution on for if looping functions, but I haven't understood R's syntax and logic in this probably. I was not able to get this to run - A) and C) Have not tried to code down B) here
GT<-read.table("SAMPLE.CSV",sep=',',skip=1)[,c(1,3,4)]
rownames(GT)<-GT[,1]
REF <- read.table("REF.CSV",sep=',')
rownames(REF)<-REF[,1]
for (i in (nrow(REF))) {
for (j in (nrow(GT))) {
if (GT[j,] %in% REF[i,]) {
ROWC[i,]<-cbind(REF[i,],paste(GT[j,2],"-",GT[j,3],sep=','))
} else {
ROWC[i,]<-cbind(REF[i,],"NA",sep=',')
}
}
}
write.table(ROWC,file='OUTPUT.CSV',quote=F,row.names=F,col.names=F)
I would be just happy if you can indicate what logic/functions would lead to reach the task I have described. I will then try to figure it out. Thx.