I am parsing FDA drug labels that are accessed from DailyMed through the API at https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/.
A sample label in XML is here: https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/2d04c305-70d6-4a99-8c82-720c5398a46c.xml
My task in R is to locate certain keywords in thousands of drug labels like this and extract out text around the keywords for further processing. If I could get the individual sentences containing the keywords, that would be ideal.
My method so far is to use the paths for higher level nodes of interest, and grab the text with xml_text()). Then use str_split() to create a list of sentences.
chunk_of_text <- xml_text()
list_of_sentences <- str_split(chunk_of_text, pattern=boundary("sentence")
This works pretty well, except when there is zero whitespace between sentences (e.g., "Sentence oneSentence two" or "Sentence one.Sentence two." This lack-of-whitespace issue occurs because xml_text() concatenates everything and whitespace is often missing after section headings, paragraphs, etc. I've tried searching "\." and replacing with "\. " before the str_split(), but this breaks numbers with decimal points.
The other option I see is to parse the XML WITHOUT using xml_read() to avoid the concatenation that slams all the text together.
These XML labels do have an expected syntax but it's not the same from label to label. I literally have to look for specific tags to get close, but the keywords could be anywhere within the large chunk of XML below that.
Is there a way to start with a path and extract all the text in that path to a list or structure or a single string (separated with whitespace)? I DON'T need to extract the structure or even understand it - I just need all the text in individual sentences (or phrases for headers, etc.) and then I can keyword search on each sentence.
Thanks!