1

Trying to parse an Hebrew .HTML webpage, and having problems using RCurl tools for that purpose. I've been reading the followings:

I have used the following R code:

library(XML)
library(RCurl)
url_get<-"http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
download.file(url_get, "codes/tmp.html")
txt <- readLines("codes/tmp.html", encoding="UTF-8")
pagetree <- htmlParse(txt, useInternalNodes = TRUE, encoding="UTF-8")

While readLines() produces proper Hebrew (בעלי מקצוע);

 txt[345]
[1] "<a id=\"professionals\" href=\"/texts/midrag.asp?parameter=\" target=\"_blank\" title=\"בעלי מקצוע\">"

the htmlParse() mess it up (׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“).

    <a href="http://shlah.agora.co.il/financial/financial1.html">׳׳¦׳׳× ׳׳”׳׳™׳ ׳•׳¡</a><br><br><span class="linkWords">׳׳•׳— ׳—׳₪׳¦׳™ ׳™׳“ ׳©׳ ׳™׳” ׳׳׳¡׳™׳¨׳” ׳‘׳—׳™׳ ׳ ׳‘׳׳‘׳“ -

Any ideas?

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255    LC_MONETARY=Hebrew_Israel.1255
[4] LC_NUMERIC=C                   LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.3 bitops_1.0-6   XML_3.98-1.1  

loaded via a namespace (and not attached):
[1] tools_3.1.1
Community
  • 1
  • 1
dof1985
  • 152
  • 1
  • 8

1 Answers1

3

I can't reproduce your problem. Here are the steps I took:

  1. First try a very simple HTML 5 document:

    library(XML)
    
    # This is the simplest valid HTML-5
    # http://www.brucelawson.co.uk/2010/a-minimal-html5-document/
    hebrew1 <- "
      <!doctype html>
      <title>בעלי מקצו</title>
    "
    
    htmlParse(hebrew1) # NOT OK
    #> <!DOCTYPE html>
    #> <html><head><title>××¢×× ×קצ×</title></head></html>
    #> 
    htmlParse(hebrew1, encoding = "UTF-8") # OK
    #> <!DOCTYPE html>
    #> <html><head><title>בעלי מקצו</title></head></html>
    #> 
    
    hebrew2 <- "
      <!doctype html>
      <meta charset=utf-8>
      <title>בעלי מקצו</title>
    "
    htmlParse(hebrew2) # OK
    #> <!DOCTYPE html>
    #> <html><head>
    #> <meta charset="utf-8">
    #> <title>בעלי מקצו</title>
    #> </head></html>
    #> 
    
  2. Try directly from URL:

    url <- "http://www.agora.co.il/toGet.asp?searchType=searchAll&amp;dealType=1&amp;dealStatus=1"
    html <- htmlParse(url, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>
    
  3. Load from disk:

    tmp <- tempfile()
    download.file(url, tmp)
    html <- htmlParse(tmp, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>
    
  4. Load from lines

    lines <- readLines(tmp)
    html <- htmlParse(lines, encoding = "UTF-8")
    XML::getNodeSet(html, "//a")[[1]]
    #> <a href="/signIn.asp?source=signIn">התחבר/י</a>
    
hadley
  • 102,019
  • 32
  • 183
  • 245
  • Thanks @hadley It worked. Yet I found it strange that the the very simple HTML-5 example didn't work, and messed up the language, while the others have worked. I uses exactly the code you presented. In addition, when I print the htmlParse file from the other methods it still messes up the text; only by using `XML::getNodeSet(html,"//a")[[1]]` the hebrew seems to work well Does someone have an idea why? – dof1985 Aug 26 '14 at 17:12
  • @dof1985 I suspect it's just a printing problem. The class of the two things are different so they probably have different printing methods and probably one has a bug in it. – hadley Aug 26 '14 at 20:40