2

I would like to find all the index of matched pattern in some string. For Example, I have a string x <- "1110001101", I want to match it with "11", the result should be c(1, 2, 7), however, I just can't get 2...

  • Method 1: Use gregexpr

    x
    [1] "1110001101"
    
    gregexpr(pattern = "11", x)
    [[1]]
    [1] 1 7 # Why isn't there a 2?
    attr(,"match.length")
    [1] 2 2
    attr(,"useBytes")
    [1] TRUE
    
  • Method 2: Use str_locate_all from package stringr

    library(stringr)
    str_locate_all(pattern = "11", x)
    [[1]]
         start end
    [1,]     1   2
    [2,]     7   8 # Why still isn't there a 2?
    

Did I lose some subtle arguments for these functions? Thanks for your suggestions!

oguz ismail
  • 1
  • 16
  • 47
  • 69
Oliver
  • 147
  • 9
  • Nice question! Just FYI, the documentation for `?gregexpr` says "`gregexpr` returns a list of the same length as text each element of which is of the same form as the return value for `regexpr`, except that the starting positions of every **(disjoint)** match are given." (my emphasis) So the behaviour is really just the implementation of `gregexpr`, not you misunderstanding regular expressions. – Hugh Jan 13 '18 at 13:02

1 Answers1

3

We can use a regex lookaround i.e. a positive regex lookahead to match a character followed by two 1's to give the positions of the start of the match with gregexpr

as.integer(gregexpr("(?=11)", x, perl = TRUE)[[1]])
#[1] 1 2 7

Or with str_locate either a regex lookbehind (in that case subtract 1)

stringr::str_locate_all(x, "(?<=11)")[[1]][,2]-1
#[1] 1 2 7

Or a regex lookahead

stringr::str_locate_all(x, "(?=11)")[[1]][,1]
#[1] 1 2 7

The difference between this approach and the OP's is that with the OP's approach, once the match is made, it skips that part and looks for the next match. This can be better explained if we look at another string

x1 <- "11110001101"
str_locate_all(pattern = "11", x1)
#[[1]]
#      start end
#[1,]     1   2
#[2,]     3   4
#[3,]     8   9

With regex lookaround, there will be 4 matches

as.integer(gregexpr("(?=11)", x1, perl = TRUE)[[1]])
#[1] 1 2 3 8
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Really cool, but I still don't know why my code can't show index `2`? – Oliver Jan 13 '18 at 12:27
  • @WenhuCao The 1st and 2nd positions got matched and you got the 'start', 'end' for that. Then, it looks for matches from the 3rd position skipping the second. You can check that with another vector `x1 <- "11110001101" ; str_locate_all(pattern = "11", x1)` – akrun Jan 13 '18 at 12:29
  • 1
    Ahh, I see, I need to read more on regexp, thanks again! But it seems that it would be reasonable setting pattern match element-wise for these functions ~~ – Oliver Jan 13 '18 at 12:30