I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).
The page that contains the error is: http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html
I read the needed String with the following piece of code:
Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();
The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.
I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.
Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.
Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?
Thanks