3

I have a single column data frame - example data:

1                          >PROKKA_00002 Alpha-ketoglutarate permease
2        MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3        QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4                                          >PROKKA_00003 lipoprotein
5       MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG

Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:

 y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
 z <- 0
 for(i in 1:nrow(df)){
   if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
     z <- z + 1
     y[z,1] <- paste(df[i])
     } else{
     y[z,2] <- paste(df[i], collapse = "")
     }
 }

I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!

Jaap
  • 81,064
  • 34
  • 182
  • 193
PTrinh
  • 33
  • 3
  • This looks like FASTA format. You may check `Biostrings::readDNAStringSet`. See e.g. [here](http://stackoverflow.com/questions/21263636/how-to-read-fasta-into-dataframe-and-extract-subsequences-of-fasta-file-in-r). – Henrik Feb 04 '16 at 21:39
  • looka fasta file to me, You can use dedicated packages like biostrings to read fasta file. or if you want to write your own, may look in to how those are done in other packages – Ananta Feb 04 '16 at 21:39
  • Thank you very much to the both of you! – PTrinh Feb 06 '16 at 01:57

3 Answers3

1

Although I will stick with packages, here is a solution

initialize data

mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL"   ,"MRTIIVIASLLLT"), stringsAsFactors = F)

process

ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))

seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
  seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}

fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)


> fastatable
                              name                     sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2         PROKKA_00003 lipoprotein  MTESSITERGAPELMRTIIVIASLLLT
Ananta
  • 3,671
  • 3
  • 22
  • 26
0

Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.

ind1 <- grepl(">", mydf$x)

#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]

#Add names
names(newdf) <- c("Name", "Value")
newdf
#            Name               Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002      MTESSITERGAPEL
# 5 >PROKKA_00003         lipoprotein
# 6 >PROKKA_00003       MRTIIVIASLLLT

Data

mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein"   ,"MRTIIVIASLLLT"))
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • I believe, the expected data frame is different, OP probably expects your columns to go in 1st column of different rows – Ananta Feb 04 '16 at 21:54
  • True. I'm on mobile now so I can't update for 15min until by a cpu. Feel free to edit in the meantime – Pierre L Feb 04 '16 at 21:56
0

You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:

library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
                   "MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
                   "QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
                   ">PROKKA_00003 lipoprotein",
                   "MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)

t <- ddply(df, "section", function(x){
  data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})

t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns

if you then view 't' I believe this is what you were looking for in your original post

JHowIX
  • 1,683
  • 1
  • 20
  • 38