Getting htmlParse to work with Hebrew?

Question

I wish to be able to have htmlParse work well with Hebrew, but it keeps to scramble the Hebrew text in pages I feed into it.

For example:

# why can't I parse the Hebrew correctly?
library(RCurl)
library(XML)
u = "http://humus101.com/?p=2737"
a = getURL(u) 
a # Here - the hebrew is fine.
a2 <- htmlParse(a)
a2 # Here it is a mess...

None of these seem to fix it:

htmlParse(a, encoding = "utf-8")
htmlParse(a, encoding = "iso8859-8")

This is my locale:

> Sys.getlocale()
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
>

Any suggestions?

I suspect `htmlParse` returns something in the encoding specified, which results in garbage if it is different from the encoding used by R. I am in a UTF-8 locale, and `htmlParse(a, encoding = "utf-8")` works fine. — Vincent Zoonekynd, Jan 30 '12 at 09:35
Hi Vincent, could you please write what locales you are using? — Tal Galili, Jan 30 '12 at 12:23
I use en_GB.UTF-8: `LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C`. — Vincent Zoonekynd, Jan 30 '12 at 12:27
mmm... I can not do this, since I am using windows. I tried setting "English" and "Hebrew" but these also didn't work. Any other suggestions? — Tal Galili, Jan 30 '12 at 12:35
The easiest is probably to convince your operating system of using UTF-8 (but I do not know Windows). If this does not work, you could try to use `iconv` to convert between UTF-8 and 1255. — Vincent Zoonekynd, Jan 30 '12 at 12:51

Richie Cotton · Answer 1 · 2012-01-30T14:01:02.247

Specify UTF-8 endoding in both the call to getURL and htmlParse.

a <- getURL(u, .encoding = "UTF-8")
htmlParse(a, encoding = "UTF-8")

These locale issues are always a pain to get to the bottom of. When I type cat(a) (after specifying UTF-8 encoding in getURL) I see that the he.wrodpress.org page claims to be UTF-8: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />, but the Hebrew bits are UTF-16. That is, they look like <U+05D3><U+05E6><U+05DE><U+05D1><U+05E8>. So it could be a problem caused by mixed oncoding of that web page.

Comparing several encodings, the only one that doswn't generate gibberish on my machine is UTF-8.

(trees <- lapply(c("UTF-8", "UTF-16", "latin1"), function(enc)
{
  a <- getURL(u, .opts = proxy_opts, .encoding = enc)
  htmlParse(a, encoding = enc)
}))

If it gets desperate, pass iconvlist() to lapply in the above code, and see if any of the possible condings works for you.

Hi Richie, thank you for giving this a look! When I tried running it this way, I still got a problem with the Hebrew (although I got a different mess). Here is an example for some of the jummbled Hebrew text: ׳₪׳•׳¢׳ ׳¢׳ WordPress. — Tal Galili, Jan 30 '12 at 12:17

Getting htmlParse to work with Hebrew?

1 Answers1

Linked