0

I have XML files of parlament protocols, of which I want to extract all of the interruptions mentioned. The interruptions are marked by brackets - like this:

Text I don't care about. 
(applause from the right)
Text I don't care about. 

I was given this code, which seemed to work just fine:

files <- as.list(dir(pattern = ".xml"))
my_list <- lapply(files, function(x) xmlToList(xmlParse(x)))
my_list2 <- lapply(my_list, function(x) enframe(regmatches(x[["TEXT"]],
                               gregexpr("(?=\\().*?(?<=\\))", x[["TEXT"]], perl=T))[[1]])

Like this I only got the (applause from the right), but now I realised, that this code apparently only considers text per line and I have some interruptions over multiple lines (1 - 3), like this

Text I don't care about.
(applause from the right and
 from the left)
Text I don't care about.

If the interruption is in this format, I get no results. How do I have to change the gregexpr to look for one line, but also for multiple lines, until the corresponding ")" is found? I've been trying \n but so far no luck.

Thanks in advance

Edit

To further explain myself: I am looking at multiple hundreds of protocols (each one has its own XML file), each with multiple hundreds of these interruptions. So I am more specifically looking for a solution to extract them all with the same code. A solution close to the code I used before would be extra helpful, since I am still fairly new to R.

1 Answers1

0

Here is one way.

Sample2 = "Text I don't care about.
(applause from the right and
 from the left)
Text I don't care about."

sub(".*\\((.*?)\\).*", "\\1", Sample2)
[1] "applause from the right and\n from the left"
G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thank you! 2 questions about this: 1. what does the "\\1" do? The wiki says its for replacement, but what is it replacing? 2. is there a way to integrate your code in the code I used before? So in combination with lapply? – Philippe_R Feb 29 '20 at 20:02
  • "\\1" is replacing the entire string with only the part that is inside the parentheses. I don't know what your data looks like and I don't really understand what you are trying to do with your code, so it is hard to fit this solution into your code. If you just want to extract the parts inside parentheses, my code will do that. Just replace Sample2 with the list of all the strings that you want to process. – G5W Feb 29 '20 at 20:10