1

I'm noticing some odd behavior with R regex quantifiers written as either {min, max} (as recommend in the stringr cheatsheet) vs. as {min - max}, when using the pointblank package. I expect the regexes to work with {min, max} and fail with {min - max}. However, in the two examples below, one works with {min, max} and one works with {min - max}.

Example 1 works as expected: pattern_comma works and pattern_dash does not. But example 2 works unexpectedly: doi_pattern_comma does not work and doi_pattern_dash does work.

Any suggestions about this regex? Or might this be a bug in pointblank (in which case I can open an issue there)?

Thank you, SO community!

library(dplyr)
library(stringr)
library(pointblank)

# EXAMPLE 1
df1 <- tibble(x = c("123", "68"))
pattern_comma <- "^\\d{1,3}$"
pattern_dash <- "^\\d{1-3}$"

stringr::str_detect(df1$x, pattern_comma) #pass

#> [1] TRUE TRUE

stringr::str_detect(df1$x, pattern_dash)  #fail

#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^\d{1-3}$`)


#pass
df1 %>% 
  pointblank::col_vals_regex(
    vars(x), 
    pattern_comma
  )

#> # A tibble: 2 x 1
#>   x    
#>   <chr>
#> 1 123  
#> 2 68


#fail
df1 %>% 
  pointblank::col_vals_regex(
    vars(x), 
    pattern_dash
  )

#> Error: Exceedance of failed test units where values in `x` should have matched the regular expression: `^\d{1-3}$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)



# EXAMPLE 2
df2 <- tibble(doi = c("10.1186/s12872-020-01551-9", "10.1002/cpp.1968"))
doi_pattern_comma <- "^10\\.\\d{4,9}/[-.;()/:\\w\\d]+$"
doi_pattern_dash <- "^10\\.\\d{4-9}/[-.;()/:\\w\\d]+$"

stringr::str_detect(df2$doi, doi_pattern_comma) #pass

#> [1] TRUE TRUE

stringr::str_detect(df2$doi, doi_pattern_dash)  #fail

#> Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`^10\.\d{4-9}/[-.;()/:\w\d]+$`)


#fail
df2 %>% 
  col_vals_regex(
    vars(doi), 
    doi_pattern_comma
  )

#> Error: Exceedance of failed test units where values in `doi` should have matched the regular expression: `^10\.\d{4,9}/[-.;()/:\w\d]+$`.
#> The `col_vals_regex()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)


#pass
df2 %>% 
  col_vals_regex(
    vars(doi), 
    doi_pattern_dash
  )

#> # A tibble: 2 x 1
#>   doi                       
#>   <chr>                     
#> 1 10.1186/s12872-020-01551-9
#> 2 10.1002/cpp.1968

Created on 2021-05-09 by the reprex package (v0.3.0)

maia-sh
  • 537
  • 4
  • 14
  • Where did you learn the `{n-m}` syntax is supported by `col_vals_regex`? `df1 %>% pointblank::col_vals_regex(vars(x),pattern_comma)` works. – Wiktor Stribiżew May 09 '21 at 14:01
  • Hi @WiktorStribiżew, `pattern_comma` works in example 1 (as expected) but `doi_pattern_comma` does not work in example 2 (and `doi_pattern_dash` does). That's the mystery to me! – maia-sh May 09 '21 at 18:09

1 Answers1

1

You must not doubt: {min-max} quantifier does not exist, you need to use {min,max}. \d{4-9} throws an exception (try it with sub and you will get invalid regular expression '\d{4-9}', reason 'Invalid contents of {}' ).

Next, the second issue is that the regex is parsed with the default TRE regex engine, and you can't use shorthand character classes like \w or \W inside bracket expressions there, so you need to use [:alnum:]_ instead of \w inside square brackets.

Now, that you know the right regex:

"^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$"

you can dive deeper.

You can see what results you get if you use test_col_vals_regex:

> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:[:alnum:]_]+$")
[1] TRUE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:[:alnum:]_]+$")
[1] NA
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4,9}/[-.;()/:\\w]+$")
[1] FALSE
> df2 %>% test_col_vals_regex(vars(doi), "^10\\.\\d{4-9}/[-.;()/:\\w]+$")
[1] NA

So, all the cases when the regex is malformed return NA and the validation for those items is skipped, passing them in the end.

CONCLUSION: Always test your regex patterns for validity before using them in col_vals_regex.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563