Trying to parse an Hebrew .HTML webpage, and having problems using RCurl tools for that purpose. I've been reading the followings:
- Getting htmlParse to work with hebrew
- Extracting a clean "UTF-8" text from a web page scraped with RCurl (this one works for Japenese, but the solution there won't fit my computer
- R-Help: getting htmlParse to work with Hebrew (on Windows)
I have used the following R code:
library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&dealType=1&dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")
While readLines() produces proper Hebrew (בעלי מקצוע);
txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"
the htmlParse() mess it up (׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“).
<a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -
Any ideas?
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6 XML_3.98-1.1
loaded via a namespace (and not attached):
[1] tools_3.1.1