I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.