0

I have a dataframe (df) with two columns. the first (df$Pos) keep a gene coordinate, the second (df$Seq) the gene sequence. There are about 20,000 rows which look like:

Pos                          Seq
chr1:1124-12324324           ggctgggtgcagtggctcatgcctgtaattc
                             ggtcagaagttcgagaccagcctggccaacattgt
                             gaaaccctgtctctactaaaaatac
chr2:767:78932               ggctgggtgcagtggctcatgcctgtaattc
                             ggtcagaagttcgagaccagcctggccaacattgt
                             gaaaccctgtctctactaaaaatac

etc

(the sequence is continuous in one row for each Pos- just couldnt format it as such)

I'm looking for a particular stretch of sequence in amongst all this like

      ggccaaggcgta

I would like the results to simply be a dataframe with the Pos and Seq of the patterns that match

I tried

dfMatch <- as.data.frame(df[grep("ggtcaggagttcgagaccag",df$V2), ])

but R stalls for ages and then doesnt return any matches. I know this matches as I've tried it in a text editor and I get about 6000 rows. I suppose R isnt ideally set up for this heavy pattern matching so I was hoping to call perl from R but I dont know how to return the result as a dataframe with the Pos column and the sequences that match.

Sebastian Zeki
  • 6,690
  • 11
  • 60
  • 125
  • 1
    When I create a 20000 x 2 data set using data like what you show, and then run the `grep` example you show, I get a result basically instantly. Perhaps there is something else going on with your data that you haven't shared...? – joran Apr 03 '15 at 21:31
  • 1
    It should be very very fast. My usual time of heavy `grep`/`gsub` is roughly ~20 seconds for 100k lines and 7000 patterns to check (i.e. some 3 million rows x pattern per second rule). For time vs pattern length (for `gsub` not `grep` though), please also see: http://stackoverflow.com/questions/27534296/gsub-speed-vs-pattern-length – Alexey Ferapontov Apr 03 '15 at 21:42

1 Answers1

1

I use stringsAsFactors when loading that data and assigned to a variable named 'dat' (since "df" is a function name in R):

dat <-
structure(list(Pos = c("chr1:1124-12324324", "chr2:767:78932"
), Seq = c("ggctgggtgcagtggctcatgcctgtaattcggtcagaagttcgagaccagcctggccaacattgtgaaaccctgtctctactaaaaatac", 
"ggctgggtgcagtggctcatgcctgtaattcggtcagaagttcgagaccagcctggccaacattgtgaaaccctgtctctactaaaaatac"
)), .Names = c("Pos", "Seq"), class = "data.frame", row.names = c(NA, 
-2L))

I also use grepl instead and the name of column is "Seq", so changed your inexplicable use of "V2" to "Seq".

dfMatch <- as.data.frame(dat[grepl("ggtcaggagttcgagaccag",dat$Seq), ])

Didn't get any hits on that test set but did when I extracted a small section for the middle of the first one:

> str(dfMatch)
'data.frame':   0 obs. of  2 variables:
 $ Pos: chr 
 $ Seq: chr 

> dfMatch <- as.data.frame(dat[grepl("gcctgtaatt",dat$Seq), ])
> str(dfMatch)
'data.frame':   2 obs. of  2 variables:
 $ Pos: chr  "chr1:1124-12324324" "chr2:767:78932"
 $ Seq: chr  "ggctgggtgcagtggctcatgcctgtaattcggtcagaagttcgagaccagcctggccaacattgtgaaaccctgtctctactaaaaatac" "ggctgggtgcagtggctcatgcctgtaattcggtcagaagttcgagaccagcctggccaacattgtgaaaccctgtctctactaaaaatac"
IRTFM
  • 258,963
  • 21
  • 364
  • 487