Using readtext to extract text from XML

Question

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.

I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:

regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
  Start tag expected, '<' not found [4]

I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.

Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) : 
  The xml format does not fit for the extraction without xPath
  Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:

> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")

I end up with a dataframe with the correct doc_id, but no text.

Do you have files of different types in the same folder? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. Maybe create a new folder and put files in one at a time till you trigger the same error. Sounds like there's an invalid XML file in there. — MrFlick, May 26 '21 at 20:40
In your original question you have `text_field = "regtext"` and here you have `text_field = ""`. The documentation says you should be supplying an Xpath expression and neither of those are an XPath expression. It's not clear what your files look like or what text you are trying to extract exactly. Sorry, it's not really possible for me to help further without a reproducible example. — MrFlick, May 26 '21 at 21:09
I'm sorry, I'm very novice at this stuff and that was a mistake adding the brackets. I'm not even sure what an XPath expression is, but perhaps it would be easier to show what I'm looking for on the URL path to an XML I would like to analyze: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. What I'm trying to do is tell readtext to read the text within the "" tag. — Dan Walters, May 26 '21 at 21:22

score 0 · Accepted Answer · answered May 27 '21 at 00:53

0

From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.

It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".

Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)

library(xml2)

url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)

#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")

#convert the regtext nodes into vector of strings
xml_text(regtext)

answered May 27 '21 at 00:53

Dave2e

22,192
18
42
50

Thanks, this definitely does recognize the text. Is there a way to use this package to automatically process an entire folder of saved XML's and save them as separate R objects? I gravitated toward readtext because this was possible, although it didn't work for the reasons you pointed out. – Dan Walters May 27 '21 at 11:37
@DanWalters, The best way to handle multiple documents is to create a loop. Read a document, process the document, save it and proceed to the next one. – Dave2e May 28 '21 at 01:26

Using readtext to extract text from XML

1 Answers1