How do I extract individual sentences from large and complex XML files in R

Question

I am parsing FDA drug labels that are accessed from DailyMed through the API at https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/.

A sample label in XML is here: https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/2d04c305-70d6-4a99-8c82-720c5398a46c.xml

My task in R is to locate certain keywords in thousands of drug labels like this and extract out text around the keywords for further processing. If I could get the individual sentences containing the keywords, that would be ideal.

My method so far is to use the paths for higher level nodes of interest, and grab the text with xml_text()). Then use str_split() to create a list of sentences.

chunk_of_text <- xml_text()

list_of_sentences <- str_split(chunk_of_text, pattern=boundary("sentence")

This works pretty well, except when there is zero whitespace between sentences (e.g., "Sentence oneSentence two" or "Sentence one.Sentence two." This lack-of-whitespace issue occurs because xml_text() concatenates everything and whitespace is often missing after section headings, paragraphs, etc. I've tried searching "\." and replacing with "\. " before the str_split(), but this breaks numbers with decimal points.

The other option I see is to parse the XML WITHOUT using xml_read() to avoid the concatenation that slams all the text together.

These XML labels do have an expected syntax but it's not the same from label to label. I literally have to look for specific tags to get close, but the keywords could be anywhere within the large chunk of XML below that.

Is there a way to start with a path and extract all the text in that path to a list or structure or a single string (separated with whitespace)? I DON'T need to extract the structure or even understand it - I just need all the text in individual sentences (or phrases for headers, etc.) and then I can keyword search on each sentence.

Thanks!

this might be a useful boilerplate (converting XML to dataframe): https://stackoverflow.com/a/33447328/20513099 — I_O, Feb 20 '23 at 12:22
It looks doable. Can you post s minimal runnable example of your current code, the output you're getting and what you would like the output to be? — SamR, Feb 20 '23 at 14:14
[Xpath](https://stackoverflow.com/a/3877167/2834978) could be used to locate specific nodes instead of wrtitng a text parser `//subject/manufacturedProduct/manufacturedProduct/name[.="someName"]` — LMC, Feb 20 '23 at 16:57
Thank you for the comments and suggestions. I am trying to explore the content at the links above and try again. I could post my code, but to be honest, I don't think that would help. The solution I have so far is probably not the right approach, and I need to back up and evaluate how to scrape differently than I am. Thanks again. — Pied Pipetter, Feb 24 '23 at 11:12

How do I extract individual sentences from large and complex XML files in R

0 Answers0