R xml encountering and dealing with html entities in an xml file

Question

Hello R's XML package users,

I am encountering a weird bug while parsing XML. It has to do with encountering HTML entities like mdash and ndash while parsing XML files.

This is the code I use:

InText = readLines(xmlFileName,n=-1)
Text = xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))

I am currently eliminating these entities like mdash and ndash using the following

InText = gsub("\\&mdash"," ",InText);
InText = gsub("\\&ndash"," ",InText);

But this can really tedious, as I see the list of possible HTML.4.0 entity list.

Any ideas on how I can eliminate these while parsing the XML files

Thanks a lot for your help and ideas Shivani

score 1 · Answer 1 · answered May 28 '12 at 06:09

If you simply want to remove all named HTML entities, use a regex:

library("XML")

InText <- "<html>\
<head>\
    <title>Test &amp; Test again</title>\
</head>\
    <body>Hello &nbsp; world</body>\
</html>"

InText <- gsub("\\&[^;]+;","",InText)

Text <-  xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))

score 1 · Answer 2 · edited May 23 '17 at 12:11

1

Try readHTML in the XML package; it has robust methods that can handle quite a few of these cases. See also Scraping html tables into R data frames using the XML package .

edited May 23 '17 at 12:11

Community

1
1

answered May 28 '12 at 07:44

Dieter Menne

10,076
44
67

R xml encountering and dealing with html entities in an xml file

2 Answers2