Recently I asked a question to create a complex regular expression to split a string (here). I was working with base R so everything worked fine. However, I now want to use the same expression into an other piece of my code which follow the tidyverse "environment" (I want to use tidyr::separate_rows
) and it doesn't work because my pattern is PRCE
and stringr
uses only the ICU library.
Reproducible example:
vec <- c("'01'", "'01' '02'",
"#bateau", "#bateau #batiment",
"#'autres 32'", "#'autres 32' #'batiment 30'", "#'autres 32' #'batiment 30' #'contenu 31'",
"#'34'", "#'34' #'33' #'35'")
I have the previous string that I need to split everywhere there is a space (), except if the space is between
'
. @Wiktor Stribiżew kindly answer my question and gave me this pattern '[^']*'(*SKIP)(*F)|\\s+
which worked perfectly in a strsplit
call:
strsplit(vec, "'[^']*'(*SKIP)(*F)|\\s+", perl=TRUE)
[[1]]
[1] "'01'"
[[2]]
[1] "'01'" "'02'"
[[3]]
[1] "#bateau"
[[4]]
[1] "#bateau" "#batiment"
[[5]]
[1] "#'autres 32'"
[[6]]
[1] "#'autres 32'" "#'batiment 30'"
[[7]]
[1] "#'autres 32'" "#'batiment 30'" "#'contenu 31'"
[[8]]
[1] "#'34'"
[[9]]
[1] "#'34'" "#'33'" "#'35'"
However, when I tried the same pattern a tidyverse function, I got this error:
stringr::str_split(vec, "'[^']*'(*SKIP)(*F)|\\s+")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify, :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
Here, @Wiktor Stribiżew was kind enough again to explain the problem which is because this is a PCRE
expression while the tidyverse uses ICU
Is there a way to make my expression work in the tidyverse? If not, what expression would work? Please note that my example use strsplit
as it is simpler to explain the problem. However, at the end, I want to use the tidyr::separate_rows
function explaining why I need a tidyverse compatible solution.