Encoding of umlaute in Jsoup with strange behaviour

Question

I have some problems with the encoding behaviour of JSoup library.

I want to parse the content of a webpage, and therefore I have to insert some person's names, that could also contain german umlaute as ä, ö, etc.

This is the code I am using:

doc = Jsoup.parse(new URL(searchURL).openStream(), "UTF-8", searchURL);

to parse the html of the resp. webpage.

But when I take a look into the document, the ä is shown as followed:

KÃ¤se

What am I doing wrong with the encoding?

The webpage has the following header:

<!doctype html>
<html>
    <head lang="en"> 
    <title>KÃ¤se - Semantic Scholar</title> 
    <meta charset="utf-8"> 
</html>

Someone help? Thanks :)

EDIT: I tried Stephans answer and it worked for the webpage www.semanticscholar.org, but I am also parsing another webpage, http://www.authormapper.com/

And the same code does not work for this webpage, if the name of an author contains a german umlaut. Does anyone know why this is not working? It's very embarissing for not to know this....

By setting a breakpoint to the line with Jsoup.parse() method and watching the frame. Then, the head contains this curios sign instead of ä. — tschens, Jun 26 '16 at 20:55

Stephan · Accepted Answer · 2016-06-28T21:11:33.003

3

This is a known issue of Jsoup. Here are two options to load the content for Jsoup:

Option 1: JDK only

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200

    int n;
    char[] buffer = new char[4096];
    Reader r = new InputStreamReader(is, "UTF-8");
    Writer w = new StringBuilderWriter();
    while (-1 != (n = r.read(buffer))) {
        w.write(buffer, 0, n);
    }

    // Parse html
    String html = w.toString();
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    try {
        if (is != null) {
            is.close();
        }
    } catch (final IOException ioe) {
        // ignore
    }
}

Option 2: With Commons IO

InputStream is = null;

try {
    // Connect to website
    URL tmp = new URL(url);
    HttpURLConnection connection = (HttpURLConnection) tmp.openConnection();
    connection.setReadTimeout(10000);
    connection.setConnectTimeout(10000);
    connection.setRequestMethod("GET");
    connection.connect();

    // Load content for Jsoup
    is = connection.getInputStream(); // We suppose connection.getResponseCode() == 200
    String html = IOUtils.toString(is, "UTF-8")

    // Parse html
    Document doc = Jsoup.parse(html, searchURL);
} catch(IOException e) {
    // Handle exception ...
} finally {
    IOUtils.closeQuietly(is);
}

Final thought:

- Never rely on website encoding if you didn't check manually (when possible) the real encoding in use.
- Never rely on Jsoup to find somehow the right encoding.
- You can [automate encoding guessing][2]. See the previous link for details.

edited Jun 28 '16 at 21:11

answered Jun 27 '16 at 08:52

Stephan

41,764
65
238
329

I tried both options and both led to the following value for html variable: KÃ¤se - Semantic Scholar After applying the following line with Jsoup.parse(html) it is the same value as in the description. KÃ¤se - Semantic Scholar – tschens Jun 27 '16 at 17:30
To clarify: I am running this code in IntelliJ, and there it does not work. But when packaging to a Jar file and running this in windows command line, it is working... – tschens Jun 27 '16 at 17:34
@tschens When running in IntelliJ check the encoding of the JVM launched by IntellJ. – Stephan Jun 27 '16 at 17:45
Man I love you, thanks for this answer :D I changed the encoding to windows-1252 but can you explain, why this is not possible with utf-8? If I type the same words in the browser, the url is working and the page shows up.... – tschens Jun 27 '16 at 18:16
@tschens What was the encoding used by IntellJ for the JVM it launched? – Stephan Jun 27 '16 at 21:25
it was on UTF-8, so I changed it to windows-1252. But meanwhile there occured another issue... that is very embarissing for me but: I am also using JSoup to parse content of the AuthorMapper webpage, all of this is done for my bachelor thesis. The code used for author mapper to retrieve the Document is the same as in your answer. The encoding of IntelliJ is set to windows 1252 but then I get a strange encoding for the same Word Käse when used as URL for AuthorMapper.... If I set JVM encoding to UTF-8 it is working... Is this depending on the websites? – tschens Jun 27 '16 at 22:29
@tschens Well, a website can always provide an HTML page encoded in UTF-8 and use characters out this encoding. So, don't rely on Jsoup for finding the right encoding, as of 1.9.2 this feature is still terrible. Neither rely on all websites too. Check the websites encoding manually (when possible) or try to automate the encoding guessing. See my final thought for details. – Stephan Jun 28 '16 at 21:14

Encoding of umlaute in Jsoup with strange behaviour

1 Answers1