17

I have a string, s="CCCGTGCC" and a subtstring ss="CC". I want to get all the indexes in s that start the string ss. In my example I would want to get back the array c(1,2,6).

Is there any string function that achieves this? Notice that my string is in the form "CCCGTGCC", and not c("C","C","C","G","T","G","C","C").

grep only returns whether there is a match anywhere in the string, and not the indexes of the matches within the string, unless I'm missing something.

oguz ismail
  • 1
  • 16
  • 47
  • 69
dan12345
  • 1,594
  • 4
  • 20
  • 30
  • Did you mean array [1, 2, 7] (actually a vector in R)? – Roman Luštrik Oct 24 '11 at 16:53
  • 1
    `gregexpr` is the function you are looking for, but the reg exp engine "swallows" tokens up, so "CCC" is counted as one "CC" and one "C", though some clever use of regexps may counter this. – James Oct 24 '11 at 17:04
  • notice about your notice, code : substring("abcde",1:5,1:5) breaks string "abcde" into vector of characters and paste((substring("abcde",1:5,1:5) ),collapse="") do the oposite – Qbik Apr 21 '12 at 08:45

1 Answers1

32

Try gregexpr with perl=TRUE and use perl regular expressions with look-ahead assertions (see ?regex):

gregexpr("(?=CC)","CCCGTGCC",perl=TRUE)
[[1]]
[1] 1 2 7
attr(,"match.length")
[1] 0 0 0
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Meh, I was stuck, but didn't think of "looking ahead". I wonder why pattern = "CC" doesn't work... – Roman Luštrik Oct 24 '11 at 17:39
  • 4
    @RomanLuštrik: see James' comment to the OP. If a match is found, it is removed from the remainder of the string being searched. Notice that the `"match.length"` is zero (it would be 2 if `pattern="CC"`). – Joshua Ulrich Oct 24 '11 at 17:48
  • 1
    +1 for the clarifying comment about the `"match.length"` of look-ahead assertions. I'd never considered using that in this way. – Josh O'Brien Oct 24 '11 at 18:34