I have a dataframe (df) with two columns. the first (df$Pos) keep a gene coordinate, the second (df$Seq) the gene sequence. There are about 20,000 rows which look like:
Pos Seq
chr1:1124-12324324 ggctgggtgcagtggctcatgcctgtaattc
ggtcagaagttcgagaccagcctggccaacattgt
gaaaccctgtctctactaaaaatac
chr2:767:78932 ggctgggtgcagtggctcatgcctgtaattc
ggtcagaagttcgagaccagcctggccaacattgt
gaaaccctgtctctactaaaaatac
etc
(the sequence is continuous in one row for each Pos- just couldnt format it as such)
I'm looking for a particular stretch of sequence in amongst all this like
ggccaaggcgta
I would like the results to simply be a dataframe with the Pos and Seq of the patterns that match
I tried
dfMatch <- as.data.frame(df[grep("ggtcaggagttcgagaccag",df$V2), ])
but R stalls for ages and then doesnt return any matches. I know this matches as I've tried it in a text editor and I get about 6000 rows. I suppose R isnt ideally set up for this heavy pattern matching so I was hoping to call perl from R but I dont know how to return the result as a dataframe with the Pos column and the sequences that match.