1

I have some data in XML files and I'm getting a "Error: 1: xmlParseEntityRef: no name". I've narrowed it down to some XML files having "&" or "<" or ">" in the data. For example, there is one where the xml is:

...<instruc>count the number of words & letters</instruc>...
...<instruc>if the number of letters per word > 6</instruc>...

I've been using the XML package and xmlParse. Is there any way I can read in this file and treat the 'bad' characters as just text?

Thanks!

ThatGuy
  • 1,225
  • 10
  • 28
  • One way to handle this is to `gsub` them, replacing with their html equivalent. This may be hard to write a proper regex to handle them. – stanekam Jun 18 '14 at 21:16
  • Good idea. How can I gsub a "<" without" changing all the tags? – ThatGuy Jun 18 '14 at 21:17
  • Well that depends on your data haha. I'm not a regex expert. – stanekam Jun 18 '14 at 21:17
  • I suggest asking that as a new question. Some proper regex people can help you out. – stanekam Jun 18 '14 at 21:18
  • 1
    This really isn't valid XML. Do you have no control over how it's generated? It's always best to try to work with clean data if possible. – MrFlick Jun 18 '14 at 21:23
  • I had no control over it, sadly. You can choose your friends but not what you inherit. – ThatGuy Jun 18 '14 at 21:24
  • @iShouldUseAName -- see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – mnel Jun 18 '14 at 23:47
  • @mnel I'm not suggesting parsing an XML document with regex. – stanekam Jun 19 '14 at 00:32
  • Please don't call it XML when it isn't. You'll only confuse people. You don't have some XML files, you have some non-XML files that you want to turn into XML. – Michael Kay Jun 19 '14 at 06:35

1 Answers1

0

Thanks to Duncan Lang, the author of the XML package:

1) Use xmlParseDoc(). This will parse the XML and remove the 'bad' characters.

2) Use htmlParse(). The resulting document will contain the &amp; corresponding to the offending & but the document will also have a <html> and <body> node and the real document will be the child of the <body> node.

This requires no changing of symbols but doesn't preclude the update of the xml file through a gsub() on the file read as plain text.

ThatGuy
  • 1,225
  • 10
  • 28