4

I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.

The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative Joaquín Castro' thought 'there must be a better solution...'

Just for clarity, using the XML package to parse this HTML

 library(XML)
 apos_str <- c("<b>Tim O'Reilly</b>")
 apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
 apos_str.parsed$children$html[[1]][[1]]

would produce

 <b>Tim O&apos;Reilly</b>

And I'd ideally like a function or package that would search for that

&apos; 

and turn it back into

'<b>Tim O'Reilly</b>'

Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.

Research I've done so far:

-Read everything I could find in the XML documentation on escaping.

-Looked for a promising package on the CRAN NLP page.

-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO. Wasn't able to make any headway so thought I would bring the question here.

Community
  • 1
  • 1
Andrew
  • 9,090
  • 8
  • 46
  • 59

1 Answers1

4

I'm not sure I understand the difficulty. String processing for replacements are done with the base regex functions: sub, gsub, regexpr, gregexpr

?sub # the same help page will also discuss 'gsub'
txt <- '<b>Tim O&apos;Reilly</b>'
sub("\\&apos;", "'", txt)
[1] "<b>Tim O'Reilly</b>"

If you had a list of values that occur between "&" and ";" you could split on those and then recombine. I suppose it is possible that you were hoping someone had already done that. You should clarify what level of abstraction you were hoping to achieve.

EDIT: A blogger discusses the specific case of "&apos" http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/

I've done some further research of my own. Those are not properly called "escapes" but rather "named entities". I cannot find any references to them in the rhelp archives. I have downloaded the XML listing from the w3.org website that defines these "enities" and am trying to convert to a tabular form that would support search and replace. But your comment about 'Representative Joaquín Castro' has me puzzled. the odd characters are not in the form "$#xxx", so ........... what exactly are you asking for? Please post a suitable test case with the expected output.

EDIT 2: The was a basically identical question from Michael Friendly that just got answered by David Carlson on Rhelp. Here's the link to the posting on the Rhelp archives:

https://stat.ethz.ch/pipermail/r-help/2012-August/321478.html

He's already done a better job than I had on creating a translation table and has included code to march through html text. (and a bonus... he included &apos). And a next-day followup from Michael Friendly has wrapped the process up in a function. You can follow the link on the Archives page.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • sorry if I wasn't clear. The apostrophe one is pretty simple and I can do it through sub or stringr; I was basically wondering if there was a better solution that would process many different escaped characters (including, say ' Joaquín Castro' -> 'Joaquín Castro') – Andrew Aug 12 '12 at 19:11
  • so, exactly, I'm asking if someone has already done this, including weird edge cases that I wouldn't immediately think of. I obviously did a bad job of making this clear in the question. – Andrew Aug 12 '12 at 19:12