Change regular expression from PRCE to ICU to comply with stringr usage

Question

Recently I asked a question to create a complex regular expression to split a string (here). I was working with base R so everything worked fine. However, I now want to use the same expression into an other piece of my code which follow the tidyverse "environment" (I want to use tidyr::separate_rows) and it doesn't work because my pattern is PRCE and stringr uses only the ICU library.

Reproducible example:

vec <- c("'01'", "'01' '02'", 
         "#bateau", "#bateau #batiment",
         "#'autres 32'", "#'autres 32' #'batiment 30'", "#'autres 32' #'batiment 30' #'contenu 31'",
         "#'34'", "#'34' #'33' #'35'")

I have the previous string that I need to split everywhere there is a space (), except if the space is between '. @Wiktor Stribiżew kindly answer my question and gave me this pattern '[^']*'(*SKIP)(*F)|\\s+ which worked perfectly in a strsplit call:

strsplit(vec, "'[^']*'(*SKIP)(*F)|\\s+", perl=TRUE)
[[1]]
[1] "'01'"

[[2]]
[1] "'01'" "'02'"

[[3]]
[1] "#bateau"

[[4]]
[1] "#bateau"   "#batiment"

[[5]]
[1] "#'autres 32'"

[[6]]
[1] "#'autres 32'"   "#'batiment 30'"

[[7]]
[1] "#'autres 32'"   "#'batiment 30'" "#'contenu 31'" 

[[8]]
[1] "#'34'"

[[9]]
[1] "#'34'" "#'33'" "#'35'"

However, when I tried the same pattern a tidyverse function, I got this error:

stringr::str_split(vec, "'[^']*'(*SKIP)(*F)|\\s+")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify,  : 
  Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

Here, @Wiktor Stribiżew was kind enough again to explain the problem which is because this is a PCRE expression while the tidyverse uses ICU

Is there a way to make my expression work in the tidyverse? If not, what expression would work? Please note that my example use strsplit as it is simpler to explain the problem. However, at the end, I want to use the tidyr::separate_rows function explaining why I need a tidyverse compatible solution.

score 1 · Accepted Answer · answered Apr 03 '20 at 21:03

Given your example, you could simply target and split spaces followed by a # with this regex \\s(?=#).

In case you need something more flexible, one solution is to first target and replace the spaces you want to split using your previous regex '[^']*'(*SKIP)(*F)|\\s+ and gsub, which accepts perl regular expression. Replace the matched spaces with an anchor (unique character or chain of characters) and separates your rows based on this anchor.

vec <- c("'01'", "'01' '02'", 
         "#bateau", "#bateau #batiment",
         "#'autres 32'", "#'autres 32' #'batiment 30'", "#'autres 32' #'batiment 30' #'contenu 31'",
         "#'34'", "#'34' #'33' #'35'")

vec %>% 
  tibble(my_col = .) %>% 
  mutate(my_col = gsub("'[^']*'(*SKIP)(*F)|\\s+", "_-_", my_col, perl = TRUE)) %>% 
  separate_rows(my_col, sep = "_-_")

# A tibble: 16 x 1
   my_col        
   <chr>         
 1 '01'          
 2 '01'          
 3 '02'          
 4 #bateau       
 5 #bateau       
 6 #batiment     
 7 #'autres 32'  
 8 #'autres 32'  
 9 #'batiment 30'
10 #'autres 32'  
11 #'batiment 30'
12 #'contenu 31' 
13 #'34'         
14 #'34'         
15 #'33'         
16 #'35'

That is clever, I really like the `gsub`approch. Thanks! – Bastien Apr 06 '20 at 16:55 — Bastien, Apr 06 '20 at 16:55

Change regular expression from PRCE to ICU to comply with stringr usage

1 Answers1