Hello R's XML package users,
I am encountering a weird bug while parsing XML. It has to do with encountering HTML entities like mdash and ndash while parsing XML files.
This is the code I use:
InText = readLines(xmlFileName,n=-1)
Text = xmlValue(xmlRoot(xmlTreeParse(InText,trim=FALSE)))
I am currently eliminating these entities like mdash and ndash using the following
InText = gsub("\\&mdash"," ",InText);
InText = gsub("\\&ndash"," ",InText);
But this can really tedious, as I see the list of possible HTML.4.0 entity list.
Any ideas on how I can eliminate these while parsing the XML files
Thanks a lot for your help and ideas Shivani