R regex lookbehind with a long expression

Question

I have a long character that comes from a pdf extraction. Below is a MWE :

MWE <- "4 BLABLA\r\n Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"

I want to separate this into a list, with each element being a table. I can do that with :

MWE_1 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))

> MWE_1
[[1]]
[1] "4 BLABLA\r\n "                                                                                 
[2] " Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA 5\r\n "
[3] " Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"

But I would like to keep the delimiter, that is here a realtively long regular expression. I have looked a bit and it seems a good way to go is to try lookbehinds. However, I do not know how to concatenante my long regular expression. For instance,
MWE_2 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))

yields an error :

invalid regular expression '(?<=[Table\s+\d+\.\s+(([A-z]|[  ]))+\r\n])', reason 'Invalid regexp'

How to do so in a compact way ?

Also, is there a direct way not to keep the first element ?

You write _I can do that with :_ `MWE_1 <- …`, but that is the exact same expression as with `MWE_2 <- …` and just as well _yields an error_, so you can't have done anything with this expression, especially not received the result you show for `MWE_1`. — Armali, Nov 05 '19 at 09:12
Thank you for your input, but have you tried it ? This is actually what I get (I added a screenshot of what I get). Taking from the answer given, I guess it is that in R the lookbehind and lookahead are not taken into account in their standard form, which is why you need the `prel=TRUE` option specified, which then works with `?= `. I am not clear why it does not work with `?<=` but I can still do what I want without it. — Anthony Martin, Nov 05 '19 at 13:05
Yes, I have tried it. What you show in the screenshot is a different expression (not beginning with `(?<=[`) than the one you give in the question text. — Armali, Nov 06 '19 at 07:09

MonkeyZeus · Answer 1 · 2019-11-05T13:14:35.233

1

Try lookahead and simplify what you are looking for:

R specific string escaping provided.

(?=Table \\d+\\.)

Make sure to enable perl=TRUE

https://regex101.com/r/Cpyu6k/1

edited Nov 05 '19 at 13:14

answered Oct 28 '19 at 18:09

MonkeyZeus

20,375
4
36
77

Either I try `MWE_1 <- as.list(strsplit(MWE, "(?=Table \d+\.)"))` and get this error : `Error: '\d' is an unrecognized escape in character string starting ""(?=Table \d"` Or try to use double \\ as in `MWE_1 <- as.list(strsplit(MWE, "(?=Table `\\d+\\.)"))` and get another error `invalid regular expression '(?=Table `\d+\.)', reason 'Invalid regexp'` – Anthony Martin Oct 28 '19 at 19:27
@AnthonyMartin Sorry, I'm not familiar with the R language. You should always escape strings per your environment rules. Based on the code samples in your question, `(?=Table \\d+\\.)` should work. – MonkeyZeus Oct 28 '19 at 19:29
@AnthonyMartin Here is an online regex tester for R and it seems to work properly when the backslash is escaped per my last comment. https://spannbaueradam.shinyapps.io/r_regex_tester/ – MonkeyZeus Oct 28 '19 at 19:36
Unfortunately it still does not work, I have the same error than in my question, the second I mention in my first comment. It also does not seem to work with the site you provided -Or I am missing something obvious ? – Anthony Martin Oct 28 '19 at 19:42
@AnthonyMartin I really don't know what to tell you. Here is a [failure screenshot](https://i.stack.imgur.com/nBhD3.png) and a [success screenshot](https://i.stack.imgur.com/iKG0H.png) – MonkeyZeus Oct 28 '19 at 19:46
1

@AnthonyMartin Try enabling `perl=TRUE` per https://stackoverflow.com/a/21493089/2191572 – MonkeyZeus Oct 28 '19 at 19:53
Well I then had not understood how the website did work. It does indeed work with the `perl=TRUE` option, although it is still not clear to me why. But now I have a five item list with `MWE_1 <- as.list(strsplit(MWE, "(?=Table \\d+\\.)", perl=TRUE))`. It does separate the "T" as an independant item in the list. If I add a space, `MWE_1 <- as.list(strsplit(MWE, "(?= Table \\d+\\.)", perl=TRUE))` I get the item I want, but still an ampty item on list[2] and list[4] – Anthony Martin Oct 28 '19 at 22:13
@AnthonyMartin `perl=TRUE` probably enables PCRE. It sounds like R supports a limited subset of regular expression by default to boost performance but the lookarounds are probably not supported. I'm not sure why it is splitting it into 5 elements. Based on those regex sites you should only be matching at two positions so your resulting array should have 3 members. – MonkeyZeus Oct 28 '19 at 23:22
@AnthonyMartin Any luck with this? – MonkeyZeus Oct 29 '19 at 14:48
Well I still have my issue with the 5 elements, and not a beginning of a clue why, `> MWE_2 [[1]] [1] "4 BLABLA\r\n" [2] " " [3] "Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017\r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n" [4] " " [5] "Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"` but I can delete the empty elements afterwards, so okay-ish although bad practice. Thank you for you help – Anthony Martin Oct 29 '19 at 15:29

score 0 · Answer 2 · answered Nov 06 '19 at 09:54

I am not clear why it does not work with ?<= …

Regular Expressions as used in R says it (you have repetition quantifiers + in the pattern):

Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....

I still have my issue with the 5 elements, and not a beginning of a clue why,
> MWE_2
[[1]]
[1] "4 BLABLA\r\n"
[2] " "
[3] "Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017\r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n"
[4] " "
[5] "Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"
but I can delete the empty elements afterwards…

There are not empty elements on index [2] and [4] - these elements contain one space. That's because the pattern in strsplit(MWE, "(?= Table \\d+\\.)", perl=TRUE) matches a delimiter of length zero, since it contains solely a zero-width positive lookahead assertion and no actual delimiter character item; strsplit would go into an infinite loop if it strictly followed its documented algorithm

    repeat {
        if the string is empty
            break.
        if there is a match
            add the string to the left of the match to the output.
            remove the match and all to the left of it.
        else
            add the string to the output.
            break.
    }

- but there's this special handling in its code:

            /* Match was empty. */
            pt[0] = *bufp;
            pt[1] = '\0';
            bufp++;

This causes one character at the position of an empty match to be returned (the space in your case) and the search to be continued after it.

The solution is simple: Don't use only a zero-width assertion as the pattern; instead, change it slightly by moving the delimiting space out of the assertion:

strsplit(MWE, " (?=Table \\d+\\.)", perl=TRUE)

R regex lookbehind with a long expression

2 Answers2

Linked