R 3.1.1 (32-bit): htmlParse() mess up hebrew texts, OS: Win 7

Question

Trying to parse an Hebrew .HTML webpage, and having problems using RCurl tools for that purpose. I've been reading the followings:

I have used the following R code:

library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")

While readLines() produces proper Hebrew (בעלי מקצוע);

 txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"

the htmlParse() mess it up (׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“).

    <a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -

Any ideas?

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255    LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C                   LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6   XML_3.98-1.1  

loaded via a namespace (and not attached):
[1] tools_3.1.1

What is the url you downloaded the file from? Without that it's impossible to reproduce your problem. — hadley, Aug 23 '14 at 20:16
@hadley you are right. Thanks. I have editted the URL into the code. — dof1985, Aug 24 '14 at 22:02

score 3 · Accepted Answer · answered Aug 25 '14 at 14:46

I can't reproduce your problem. Here are the steps I took:

First try a very simple HTML 5 document:

library(XML)

# This is the simplest valid HTML-5
# http://www.brucelawson.co.uk/2010/a-minimal-html5-document/
hebrew1 <- "
  <!doctype html>
  <title>בעלי מקצו</title>
"

htmlParse(hebrew1) # NOT OK
#> <!DOCTYPE html>
#> <html><head><title>××¢×× ××§×¦×</title></head></html>
#> 
htmlParse(hebrew1, encoding = "UTF-8") # OK
#> <!DOCTYPE html>
#> <html><head><title>בעלי מקצו</title></head></html>
#> 

hebrew2 <- "
  <!doctype html>
  <meta charset=utf-8>
  <title>בעלי מקצו</title>
"
htmlParse(hebrew2) # OK
#> <!DOCTYPE html>
#> <html><head>
#> <meta charset="utf-8">
#> <title>בעלי מקצו</title>
#> </head></html>
#>

Try directly from URL:

url <- "http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
html <- htmlParse(url, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>

Load from disk:

tmp <- tempfile()
download.file(url, tmp)
html <- htmlParse(tmp, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>

Load from lines

lines <- readLines(tmp)
html <- htmlParse(lines, encoding = "UTF-8")
XML::getNodeSet(html, "//a")[[1]]
#> <a href="/signIn.asp?source=signIn">התחבר/י</a>

Thanks @hadley It worked. Yet I found it strange that the the very simple HTML-5 example didn't work, and messed up the language, while the others have worked. I uses exactly the code you presented. In addition, when I print the htmlParse file from the other methods it still messes up the text; only by using `XML::getNodeSet(html,"//a")[[1]]` the hebrew seems to work well Does someone have an idea why? — dof1985, Aug 26 '14 at 17:12
@dof1985 I suspect it's just a printing problem. The class of the two things are different so they probably have different printing methods and probably one has a bug in it. — hadley, Aug 26 '14 at 20:40

R 3.1.1 (32-bit): htmlParse() mess up hebrew texts, OS: Win 7

1 Answers1