2

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).

The page that contains the error is: http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html

I read the needed String with the following piece of code:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.

I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.

Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.

Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?

Thanks

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Pascal Mathys
  • 609
  • 1
  • 9
  • 17
  • Hmm, it's an — (e2 80 93), which under UTF-8 should be a valid character (I think). Is it possible that once it's read in as 8859-1 it's not possible to convert it back? Can you force-read it in as UTF-8? – Dave Newton Oct 10 '11 at 15:35
  • Yes i can force it with out.outputSettings().charset("UTF-8"), but that doesn't really help. When i want to show the character codes, the result is the charcode 150, which should be valid as seen at this page: http://www.web-source.net/symbols.htm. With this, i realized, that the char is not a hyphen or dash, which would be 45. The charcode 150 is within the extended ascii charset. – Pascal Mathys Oct 10 '11 at 15:55

1 Answers1

7

This is a mistake of the website itself. It are actually three mistakes:

  1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

  2. The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

  3. According to the webpage source code, the product name uses the literal character instead of the HTML entity &ndash; as spotted elsewhere on the same webpage.

Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • It looks like browsers tend to show `0x96` as en-dash even if ISO-8859-1 is specified in `Content-Type` header. – axtavt Oct 10 '11 at 16:10
  • @axtavt: there's no charset in the content type header. The platform default charset will be used, which is CP1252 in Windows. See also point 1. – BalusC Oct 10 '11 at 16:12
  • Thanks for the clear explanation about this problem! With the manual encoding (Which i tried the same way yesterday with ISO-8859-1), the content is correctly encoded. I will contact the website operator about this problem, hoping he can correct this problem by setting either the page to utf-8 or setting the Content-Type Header to ISO-8859-1. – Pascal Mathys Oct 10 '11 at 16:19
  • Not only that, the offending character must also be fixed. Depending on the source of the problem, it should be fixed by using UTF-8 to store data in DB or to use `htmlentities()` to redisplay titles in HTML. It's a CP1252 specific character. Alone changing the content type charset to ISO-8859-1 or UTF-8 will fail as this character won't be displayed as such at all then (which is exactly the problem you encountered yourself). – BalusC Oct 10 '11 at 16:21
  • what about the user agent? How can it be set in this case? – Luís Soares Dec 14 '15 at 17:07
  • @BalusC OK that solves it; thanks. Still, Jsoup does not offer that in a chainable way, if you need to define encoding (as in the original question) and user agent. – Luís Soares Dec 16 '15 at 17:46