0

I can't seem to figure out how to match identical characters in regex in R. Suppose I have this data:

dt <- c("12345", "asdf", "#*ยง", "AAAA", ";;;;", "9999", "%:=+")

I'm able to extract all strings that consist exactly of any 4 non-whitespace characters, for example like this:

pattern <- "\\S{4}"
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
extract(dt)
[1] "1234" "asdf" "AAAA" ";;;;" "9999" "%:=+"

But what I really want to match are those strings in which the same character is repeated 4 times, giving this ouput:

[1] "AAAA" ";;;;" "9999" 

Any ideas?

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • 1
    Change the quantifier in [this](https://stackoverflow.com/q/38263441/5325862) to `{3}` โ€“ camille Feb 07 '20 at 16:38
  • 1
    Try a capture group and reference it... It gets a little tricky because you need to capture it then use a back-reference to look for 3 more like this: `(\\S)\\1{3}` so you have 4 characters total. โ€“ dvo Feb 07 '20 at 16:39
  • @dvo Thanks a lot: `grep("(\\S)\\1{3}", dt, value = T)`outputs `[1] "AAAA" ";;;;" "9999"`, i.e., the desired result โ€“ Chris Ruehlemann Feb 07 '20 at 16:50

0 Answers0