Identifying strings based on where substrings appear in the string

Question

Imagine I have a set of strings, say:

#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"

Now I want to check whether certain patterns occur in the first, middle, or last third in the sequence. Hence, I want to be able to formulate a rule of the kind:

Match the string if, and only if, 
marker X occurs in the first/middle/last third of the string

For example, I may want to match strings which have an A in the first third. The considering the sequences above I would match #1 and #2. I could also want to match strings which have an A in the last third. This would match #2 and #3.

How can I write a generic code/regex pattern that can take various rules of this kind as input and then match the appropriate subsequences?

Doesn't sound like something to be solved with regex. Define rules with functions, which operate on the input string is more flexible. — nhahtdh, Jul 16 '15 at 09:52
@nhahtdh: it probably needs both functions and regexes (since whatever it is that I want to match has to be defined with a regex, even if it is a simple one). — histelheim, Jul 16 '15 at 09:57
I don't think there is a way for a regular expression to divide a string into thirds dynamically because regex can't _count_. You could however, dynamically construct a regex quantifier based on a dynamic variable with its runtime known string length (divided by 3). Then finding what you want is trivial. — , Jul 21 '15 at 23:44
Why not split by `-`, [slice the array](http://stackoverflow.com/q/2123968/7586), and look for the element? Also, what do you mean by "certain patterns occur" - what patterns, besides just `A`, `B`, or `C`? — Kobi, Jul 22 '15 at 05:17
like @Kobi said, why not just split and look for whatever you want? `which(grepl('A', sp <- strsplit('A-B-B-C-C', '-')[[1]])) / length(sp)` returns .2 meaning that A occurs in the first third of the string. and `grep` takes regular expressions so you can use something other than just "A" but it is not clear from your question why you would need anything more complicated than this. also you should show what you have tried — rawr, Jul 23 '15 at 23:42

David Arenburg · Accepted Answer · 2015-07-22T07:53:52.513

Here's a fully vectorized attempt (you can play around with the settings and tell me if you want to add/change something)

StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
  xsub <- gsub("-", "", x, fixed = TRUE)
  sizes <- nchar(xsub) / frac
  subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
  if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}

Testing on your vector

x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3

Or if you want the actual matched strings

StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C"     "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"

Please note that when the string size doesn't divides exactly by 3 (for example, nchar(x) == 10), the last third is the largest group by default (e.g. size 4 if nchar(x) == 10)

score 2 · Answer 2 · answered Jul 22 '15 at 17:11

Here's a solution which generates regexes to meet the desired requirements. Note regexes can count, but they can't count relative to the total string. So this generates a custom regex for each input string based on its length. I've used the stringi::stri_detect_regex rather than grep since the latter isn't vectorised on the pattern term. I've also assumed that the pattern argument is itself a valid regular expression and that any characters that need escaping (e.g. [, .) are escaped.

library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
  before <- round(nchar(strings) / n_groups * (which_fraction - 1))
  after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
  sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"

#Test regexs:
stri_detect_regex(strings, patterns)
#[1]  TRUE  TRUE FALSE

lukeA · Answer 3 · 2015-07-22T07:08:43.723

1

Here's one option:

f <- function(txts, needle, operator, threshold) {
  require(stringi)
  txts <- gsub("-", "", txts, fixed = TRUE)             # delete '-'s
  matches <- stri_locate_all_fixed(txts, needle)        # find matches 
  ends <- lapply(matches, function(x) x[, "end"])       # extract endposition of matches (= start)
  ends <- mapply("/", ends, sapply(txts, nchar) + 1)    # divide by string length+1
  which(sapply(mapply(operator, ends, threshold), any)) # return indices of matches that fulfill restriction of operator and its threshold
}
txts <- c("A-A-B-B-C-C", "A-A-A-A-A-A", "B-B-B-C-A-A")
idx <- f(txts, needle = "A", operator = "<=", threshold = .333)
txts[idx]
# [1] "A-A-B-B-C-C" "A-A-A-A-A-A"

edited Jul 22 '15 at 07:08

answered Jul 16 '15 at 09:57

lukeA

53,097
5
97
100

Could you explain the function of the `operator` argument? Can I also use `=>` and `=` here? – histelheim Jul 21 '15 at 17:31
it doesn't seem to work when I try it - can you clarify how to use the function? Some examples would be helpful. – histelheim Jul 21 '15 at 17:55
I meant `>=` and `==`. See `?Compare`. E.g. `txts <- c("A-A-B-D-B-C-C", "A-D-A", "B-B-D-B-C-A-A"); f(txts, needle = "D", operator = "==", threshold = .50); f(txts, needle = "C", operator = ">=", threshold = 6/7); f(txts, needle = "B", operator = "<=", threshold = 1/7)`. – lukeA Jul 22 '15 at 07:08

Identifying strings based on where substrings appear in the string

3 Answers3