Is there any way to parse XML with "&" or "<" or ">" in the data in R

Question

I have some data in XML files and I'm getting a "Error: 1: xmlParseEntityRef: no name". I've narrowed it down to some XML files having "&" or "<" or ">" in the data. For example, there is one where the xml is:

...<instruc>count the number of words & letters</instruc>...
...<instruc>if the number of letters per word > 6</instruc>...

I've been using the XML package and xmlParse. Is there any way I can read in this file and treat the 'bad' characters as just text?

Thanks!

One way to handle this is to `gsub` them, replacing with their html equivalent. This may be hard to write a proper regex to handle them. — stanekam, Jun 18 '14 at 21:16
Good idea. How can I gsub a "<" without" changing all the tags? — ThatGuy, Jun 18 '14 at 21:17
Well that depends on your data haha. I'm not a regex expert. — stanekam, Jun 18 '14 at 21:17
I suggest asking that as a new question. Some proper regex people can help you out. — stanekam, Jun 18 '14 at 21:18
This really isn't valid XML. Do you have no control over how it's generated? It's always best to try to work with clean data if possible. — MrFlick, Jun 18 '14 at 21:23
I had no control over it, sadly. You can choose your friends but not what you inherit. — ThatGuy, Jun 18 '14 at 21:24
@iShouldUseAName -- see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — mnel, Jun 18 '14 at 23:47
@mnel I'm not suggesting parsing an XML document with regex. — stanekam, Jun 19 '14 at 00:32
Please don't call it XML when it isn't. You'll only confuse people. You don't have some XML files, you have some non-XML files that you want to turn into XML. — Michael Kay, Jun 19 '14 at 06:35

ThatGuy · Accepted Answer · 2014-06-18T22:22:22.997

Thanks to Duncan Lang, the author of the XML package:

1) Use xmlParseDoc(). This will parse the XML and remove the 'bad' characters.

2) Use htmlParse(). The resulting document will contain the & corresponding to the offending & but the document will also have a <html> and <body> node and the real document will be the child of the <body> node.

This requires no changing of symbols but doesn't preclude the update of the xml file through a gsub() on the file read as plain text.

Is there any way to parse XML with "&" or "<" or ">" in the data in R

1 Answers1