1

Hello R's XML package users,

I am encountering a weird bug while parsing XML. It has to do with encountering HTML entities like mdash and ndash while parsing XML files.

This is the code I use:

InText = readLines(xmlFileName,n=-1)
Text = xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))

I am currently eliminating these entities like mdash and ndash using the following

InText = gsub("\\&mdash"," ",InText);
InText = gsub("\\&ndash"," ",InText);

But this can really tedious, as I see the list of possible HTML.4.0 entity list.

Any ideas on how I can eliminate these while parsing the XML files

Thanks a lot for your help and ideas Shivani

2 Answers2

1

If you simply want to remove all named HTML entities, use a regex:

library("XML")

InText <- "<html>\
<head>\
    <title>Test &amp; Test again</title>\
</head>\
    <body>Hello &nbsp; world</body>\
</html>"

InText <- gsub("\\&[^;]+;","",InText)

Text <-  xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))
daedalus
  • 10,873
  • 5
  • 50
  • 71
1

Try readHTML in the XML package; it has robust methods that can handle quite a few of these cases. See also Scraping html tables into R data frames using the XML package .

Community
  • 1
  • 1
Dieter Menne
  • 10,076
  • 44
  • 67