I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.
The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative JoaquÃÂn Castro' thought 'there must be a better solution...'
Just for clarity, using the XML package to parse this HTML
library(XML)
apos_str <- c("<b>Tim O'Reilly</b>")
apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
apos_str.parsed$children$html[[1]][[1]]
would produce
<b>Tim O'Reilly</b>
And I'd ideally like a function or package that would search for that
'
and turn it back into
'<b>Tim O'Reilly</b>'
Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.
Research I've done so far:
-Read everything I could find in the XML documentation on escaping.
-Looked for a promising package on the CRAN NLP page.
-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO. Wasn't able to make any headway so thought I would bring the question here.