How to set a regex in relation to overall length of string?

Question

I have a list of character strings:

> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"

[[2]]
[1] "FA,TRFA"

[[3]]
[1] "FAEX"

I am trying to identify specific patterns in these character strings, as such:

library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]

FA_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FA)+")
FA_cluster <- subset(g_patterns_clean, FA_logic_vector)

Let's now say that I want to check for strings where "FA" accounts for X% of the total strings length (e.g. "FA" accounts for 25% of the characters in each string that is going to be matched). How can I accomplish this?

Extract the thing you're after and do `nchar(the_thing)/nchar(the_whole_thing)` — Frank, Jun 29 '15 at 18:47
Isnt this duplicate of http://stackoverflow.com/questions/31123048/how-to-allow-for-arbitrary-number-of-wildcards-in-regexes ? — vks, Jun 29 '15 at 18:49
@vks: read the questions at the end - they are totally different. — histelheim, Jun 29 '15 at 18:53
Yes, it's clear that they are different. The problem (for me) in both questions is that your example is grossly far from being minimal and self-contained. You want us to download your gist, really? You can't contain your example within the code block of the question itself? Here's the reference for asking good R questions: http://stackoverflow.com/a/28481250/1191259 — Frank, Jun 29 '15 at 18:53
@Frank: Why is it problematic to download the gist? There is no manual labor whatsoever required on your part - just run the code and it will get all the data for you. — histelheim, Jun 29 '15 at 18:54
It's (mildly) problematic because I don't want to do it, is all. I'm not saying you *can't* do it this way, but it's kind of nice to understand the question from reading it instead of running it, especially for regexes, where the actual values of variables are key. — Frank, Jun 29 '15 at 18:56
@Frank: Would it help if I expanded the `head(g_patterns_clean_strings)` provided in the question? — histelheim, Jun 29 '15 at 18:58
I would say `x <- c("1FAFA","FA,TRFA","FAEX")` is my input; `y <- c(TRUE,FALSE,TRUE)` is my output (or whatever) and this is the rule behind it, in words; and `some_code` is what I've tried and here's why I think it doesn't work yet. Make `x` & `y` only as large as necessary to match your use-case and maybe reference this gists as your full problem, but don't expect us to load and work with it just to grasp the problem. — Frank, Jun 29 '15 at 19:01
@Frank: There is merit to your idea. However, my experience with doing this is that it often misses important edgecases which are only revealed when using a larger sample of the data... — histelheim, Jun 29 '15 at 19:02
Okay, just my two cents. Again, there's nothing really problematic with how you're asking. — Frank, Jun 29 '15 at 19:03
@Frank: Thanks, anyhow - I understand the principle you are advocating in your answer (i.e. `nchar(the_thing)/nchar(the_whole_thing)`), but don't see right away how to implement it. Could you provide some further pointers? — histelheim, Jun 29 '15 at 19:04
The author of devtools, hadley, also has a package for string manipulations, `stringr`. One function in that package is `str_extract`, if that's the part you're missing. I'm not really a regex wizard and don't know all the intermediate steps required, so I'll leave a proper answer to someone else. :) — Frank, Jun 29 '15 at 19:06

Nick Kennedy · Answer 1 · 2015-06-29T19:40:15.380

Here's a way of doing it with stringr and the pipe operator %>% from magrittr. The provided function takes a vector of strings, a pattern and a minimum proportion and returns a logical vector of the same length as the input indicating whether the pattern makes up at least that proportion.

library("magrittr")
library("stringr")

checkPatternProportion <- function(strings, pattern, proportion) {
  strings %>%
    str_extract_all(pattern) %>%
    lapply(paste, collapse = "") %>%
    {nchar(.) / nchar(strings) >= proportion}
}

Usage:

set.seed(123)
myStrings <- replicate(100, c("AB", "FA", "GE", "DE") %>%
  sample(sample(1:8), replace = TRUE) %>%
  paste(collapse = ""))

head(myStrings, 10)
#  [1] "GEFADE"           "DEDEGEGE"         "DEGEDEABFADEABFA" "FAFA"             "DEDEFAGEABFAFA"  
#  [6] "GEABFAABFAGEFA"   "DE"               "DEABFAGEGEFAFADE" "ABDEGEAB"         "FADEABABAB"      

matches <- checkPatternProportion(myStrings, "FA", 0.25)
head(matches, 10)
# [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE

This works great. Could you explain the line `{nchar(.) / nchar(strings) >= proportion}`? a) Why are the `{}` brackets there; b) Is the `.` a regex? — histelheim, Jun 30 '15 at 08:53
This is using an arbitrary expression rather than a function in a chain of expressions. When using `%>%` the `.` is effectively substituted by the output of the left hand side. So `x %>% sin()` is equivalent to `x %>% sin(.)`. It also allows for use of functions and expressions where the desired position of the left hand side output is not the first argument. E.g. `x %>% as.character %>% gsub(" ", "", .)`. Here, `gsub` expects the input character vector to be the third parameter. — Nick Kennedy, Jun 30 '15 at 10:26

How to set a regex in relation to overall length of string?

1 Answers1