I have XML files of parlament protocols, of which I want to extract all of the interruptions mentioned. The interruptions are marked by brackets - like this:
Text I don't care about.
(applause from the right)
Text I don't care about.
I was given this code, which seemed to work just fine:
files <- as.list(dir(pattern = ".xml"))
my_list <- lapply(files, function(x) xmlToList(xmlParse(x)))
my_list2 <- lapply(my_list, function(x) enframe(regmatches(x[["TEXT"]],
gregexpr("(?=\\().*?(?<=\\))", x[["TEXT"]], perl=T))[[1]])
Like this I only got the (applause from the right)
, but now I realised, that this code apparently only considers text per line and I have some interruptions over multiple lines (1 - 3), like this
Text I don't care about.
(applause from the right and
from the left)
Text I don't care about.
If the interruption is in this format, I get no results. How do I have to change the gregexpr to look for one line, but also for multiple lines, until the corresponding ")" is found? I've been trying \n
but so far no luck.
Thanks in advance
Edit
To further explain myself: I am looking at multiple hundreds of protocols (each one has its own XML file), each with multiple hundreds of these interruptions. So I am more specifically looking for a solution to extract them all with the same code. A solution close to the code I used before would be extra helpful, since I am still fairly new to R.