I have a long character that comes from a pdf extraction. Below is a MWE :
MWE <- "4 BLABLA\r\n Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"
I want to separate this into a list, with each element being a table. I can do that with :
MWE_1 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))
> MWE_1
[[1]]
[1] "4 BLABLA\r\n "
[2] " Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA 5\r\n "
[3] " Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"
But I would like to keep the delimiter, that is here a realtively long regular expression.
I have looked a bit and it seems a good way to go is to try lookbehinds. However, I do not know how to concatenante my long regular expression. For instance,
MWE_2 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))
yields an error :
invalid regular expression '(?<=[Table\s+\d+\.\s+(([A-z]|[ ]))+\r\n])', reason 'Invalid regexp'
How to do so in a compact way ?
Also, is there a direct way not to keep the first element ?