Encoding issues crawling non-english websites

Question

I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

Which outputs:

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

Though it ought to be:

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.

Did you try with [Apache Http Client 4.x](http://hc.apache.org/httpcomponents-client-ga/)? I find it much more comfortable and stable to work with. Should take care of most of the encoding madness, too -- the handling of the `` element Joel mentioned below would still be up to you, though, but [EntityUtils](http://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/util/EntityUtils.html) goes a long way. — Philipp Reichart, Sep 30 '11 at 21:39
The fact that you're getting '?' and not U+FFFD is telling here. Maybe there is an implicit interpretation of ISO-8859-1 going on. Many parts of the standard library default to this encoding. — wberry, Sep 30 '11 at 22:25
How do you know that the decoding is happening incorrectly, and not the encoding of the debug output? Before you return the string, you should print the numeric values of the characters and see what they are as a check. — erickson, Sep 30 '11 at 23:22
Well well well, looks like this is an OS specific issue. Running on my Mac it outputs ???? but running on my Linux machine works just fine. The first several characters are 10 1056 1077 1079 1091 1083 1100 1090 1072 1090 1086 1074 - not sure what to interpret from that, but they're not actually question marks. — dimo414, Oct 01 '11 at 01:26

score 2 · Answer 1 · edited May 23 '17 at 12:12

Determining the right charset encoding can be tricky.

You need to use a combination of

a) the HTML META Content-Type tag:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

b) the HTTP response header:

Content-Type: text/html; charset=utf-8

c) Heuristics to detect charset from bytes (see this question)

The reason for using all three is:

(a) and (b) might be missing
the META Content-Type might be wrong (see this question)

What to do if (a) and (b) are both missing?

In that case you need to use some heuristics to determine the correct encoding - see this question.

I find this sequence to be the most reliable for robustly identifying the charset encoding of an HTML page:

Use HTTP response header Content-Type (if exists)
Use an encoding detector on the response content bytes
use HTML META Content-Type

but you might choose to swap 2 and 3.

score 1 · Accepted Answer · answered Oct 01 '11 at 04:26

The problem you are seeing is that the encoding on your Mac doesn't support Cyrillic script. I'm not sure if it's true on an Oracle JVM, but when Apple was producing their own JVMs, the default character encoding for Java was MacRoman.

When you start your program, specify the file.encoding system property to set the character encoding to UTF-8 (which is what Mac OS X uses by default). Note that you have to set it when you launch: java -Dfile.encoding=UTF-8 ...; if you set it programatically (with a call to System.setProperty()), it's too late, and the setting will be ignored.

Whenever Java needs to encode characters to bytes—for example, when it's converting text to bytes to write to the standard output or error streams—it will use the default unless you explicitly specify a different one. If the default encoding can't encode a particular character, a suitable replacement character is substituted.

If the encoding can handle the Unicode replacement character, U+FFFD, (�) that's used. Otherwise, a question mark (?) is a commonly used replacement character.

I tested with my iMac, and on java version "1.6.0_26", the default encoding is still "MacRoman". This is true even though my `LANG` is set to "en_US.UTF-8". — erickson, Oct 01 '11 at 14:41
Adding that system property flag output the following: –†–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤ — –Ø–Ω–¥–µ–∫—Å: –ù–∞—à–ª–æ—Å—å 298 –º–ª–Ω –æ—Ç–≤–µ—Ç–æ–≤ — dimo414, Oct 03 '11 at 14:07
Here we go! I found http://www.ibm.com/developerworks/opensource/library/os-eclipse-osxjava/ which describes how to set UTF-8 in eclipse. The -D flag you mention works correctly at the command line. Thanks for your help. — dimo414, Oct 03 '11 at 14:51

score 0 · Answer 3 · answered Sep 30 '11 at 21:58

0

Apache Tika contains an implementation of what you want here. Many people use it for this. You could also look into Apache Nutch. On the other hand, then you wouldn't have to implement your own crawler at all.

answered Sep 30 '11 at 21:58

bmargulies

97,814
39
186
310

Encoding issues crawling non-english websites

3 Answers3

Linked