-3

So I am trying to loop over a data.frame in R where I have proteins and all of the protein subregions. The identifying factor is the geneID. The first occurrence of the geneID is always the whole protein. The following occurrences are the subregions. I am trying to align the subregions with the whole protein to determine the start and stop locations and then add that back to the DF. The data looks like this:

https://i.stack.imgur.com/tGPok.jpg

The code I am working on looks like this, problem is it is stuck on the first iteration. Not sure what I am doing wrong:

  for(i in 1:length(keyplayers$geneid)) {
    id <- keyplayers$geneid[[i]]
    a <- i + 1
  while(keyplayers$geneid[[a]] == keyplayers$geneid[[i]]) {
    pat <- matchPattern(keyplayers$sequence[[a]] , keyplayers$sequence[[i]])
    keyplayers$start[a] <- start(pat)
    keyplayers$end[a] <- end(pat)

  }
    }
  • You should try to post a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Provide sample input (not just a picture of the data) and give the desired output for that input. List all packages you are using (`matchPattern` isn't a base R function). – MrFlick May 12 '17 at 13:43
  • Sorry about that! Ill update asap! – Hakim Elakhrass May 12 '17 at 13:56

1 Answers1

0

This is my suggestion using ddply and assuming that the longest peptide for each gene_id would be the full length protein:

require(ddply)
df$Len <- nchar(df$sequence)
ddply(df, .(gene_id), transform, Start=sapply(sequence, function(x) regexpr(x, sequence[which.max(Len)])))
df$End <- df$Start + df$Len - 1

DATA

df <- read.table(text="
gene_id, sequence
p53,MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
p53,SQETFSDLWKLLPENNVLSPLPSQAMDDLMLS
p53,APRMPEAAPPVAPAPAAPTPAAPAPAPSWP
mdm2,MCNTNMSVPTDGAVTTSQIPASEQETLVRPKPLLLKLLKSVGAQKDTYTMKEVLFYLGQYIMTKRLYDEKQQHIVYCSNDLLGDLFGVPSFSVKEHRKIYTMIYRNLVVVNQQESSDSGT
mdm2,MCNTNMSVPTDGAVTTSQIPASEQE
mdm2,QKDTYTMKEVLFYLGQYIMTKRLYDEKQQHIVYCSNDLLGDLFG", header=T, sep=',', stringsAsFactors=F)
Osdorp
  • 190
  • 7