2

I am trying to use regex to parse a file using regex. Most of the solutions to using regex in R use the stringr package. I have not found another way, or another package to use that would work. If you have another way of going about this that would also be acceptable.

What I am trying to accomplish is to grab a couple of values that are seperated by spaces with the last value being some comma seperated values of variable length. This should go into a matrix or df in table like format is it is currently.

foo     foo_123bar      foo,bar,bazz
foo2    foo_456bar      foo2,bar2

I have the working example of my regex here.

There could be a couple of issues I could be running into. The first could be that the regex I am writing is not supported by R's regex engine. Although I have the feeling from this that would be supported. I have seen that R uses a POSIX like format which could make things interesting. The second simply could be exactly what the error message bellow is showing. This is not a feature that has been coded in yet. This however would be the most troubling because I don't know another way to solve my problem without this package.

Below is the R code that I am using to replicate this error

library("stringr")

string = " foo  foo_123bar      foo,bar,bazz\n  foo2    foo_456bar      foo2,bar2,bazz2"

pattern = "
  (?(DEFINE)
    (?<blanks>[[:blank:]]+)
    (?<var>\"?[[:alnum:]_]+\"?)
    (?<csvar>(\"?[[:alnum:]_]+\"?,?)+)
  )
  ^
    (?&blanks)((?&var))
    (?&blanks)((?&var))
    (?&blanks)((?&csvar))"

# Both of these are throwing the error
str_extract_all(string, pattern)
str_extract_all(string, regex(pattern, multiline=TRUE, comments=TRUE))

> Error in stri_extract_all_regex(string, pattern, simplify = simplify,  : 
> Use of regexp feature that is not yet implemented. (U_REGEX_UNIMPLEMENTED)


# Using the example from ?str_extract_all runs without error
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)

I am looking for a solution, not necessarily a stringr solution, but this is the only way I found that fits my needs. The other simpler R regex functions only accept the pattern and not the extra parameters that include the multi line and comment functionality that I am using.

user2716722
  • 93
  • 11
  • You are trying to parse a PCRE-specific regex with an ICU regex library. That is impossible. Either use it with the base R `regmatches` or re-vamp to follow the ICU syntax. ICU does not support recursion, so you can't reuse the patterns the way you did in the PCRE pattern. – Wiktor Stribiżew Aug 24 '17 at 19:24
  • Does this work as expected - https://ideone.com/lT6RxR? – Wiktor Stribiżew Aug 24 '17 at 19:29
  • 1
    There are too many engines all with their own rules. I was using the define style recursion to make my regex easier to read and understand. In actuality I have 7+ groups of these values and this can only get messier. Is there a way in ICU to make more modular regex? @Wiktor You replied as I was commenting. This does work. It seems to undermine the whole stringr library but I am okay with that. – user2716722 Aug 24 '17 at 19:31
  • You may define the regex parts as variables, then just `paste0` them (i.e. build the pattern dynamically). – Wiktor Stribiżew Aug 24 '17 at 19:33

1 Answers1

3

You have a PCRE regex that can only be used in methods/functions that parse the regex with the PCRE regex library (or Boost, it is based on PCRE). stringr str_extract parses the regex with the ICU regex library. ICU regex does not support recursion and DEFINE block. You just can't use the in-pattern approach to define subpatterns and then re-use them.

Instead, just declare the regex parts you need to re-use as variables and build the pattern dynamically:

library("stringr")
string = " foo  foo_123bar      foo,bar,bazz\n  foo2    foo_456bar      foo2,bar2,bazz2"
blanks <- "[[:blank:]]+"
vars <- "\"?[[:alnum:]_]+\"?"
csvar <- "(?:\"?[[:alnum:]_]+\"?,?)+"
pattern <- paste0("^",blanks,"(", vars, ")",blanks,"(", vars,")",blanks,"(",csvar, ")")
str_match_all(string, pattern)
# [[1]]
#     [,1]                                 [,2]  [,3]         [,4]          
#[1,] " foo  foo_123bar      foo,bar,bazz" "foo" "foo_123bar" "foo,bar,bazz"

Note: you need to use str_match (or str_match_all) to extract the capturing group values as str_extract or str_extract_all only allows access to the whole match values.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Nice call on the str_match_all. I was using it before when things were simpler but haven't seen an output to know I was using the wrong function. I don't particularly like the paste0 method as it is a little less clear what is going on with the regex. However it is better than rewriting that code out. That would be a much larger mess – user2716722 Aug 24 '17 at 19:51
  • Yeah, that was what I am doing now. But to answer my own question a bit I did find a regex building library that is an R wrapper for building regular expressions. This is probably the most readable way of going about it. It is called ['rex'](https://cran.r-project.org/web/packages/rex/rex.pdf) I may end up using this but it is all a matter of if I want to learn it or not ;) – user2716722 Aug 25 '17 at 15:25
  • @user2716722 Interesting. There are other libraries that might help I think. As usual, there are always ways to do the same thing in different ways. – Wiktor Stribiżew Aug 25 '17 at 15:58