0

we are using jsoup - excellent thanks.

We may get HTML files with no http-equiv meta tag and the charset may be other than UTF-8. How is it best to handle this please. We can have a list of encodings and try them but I am not sure how to tell programatically if something is wrong. Would jsoup throw an IOException?

1 Answers1

0

Jsoup will try to determine the encoding by the content type header or http equiv tag, if you have none of them it will use utf8. Not sure if jsoup can do more for you here.

But you can try another approach:

Implement a class that reads the files for you. There you can take care of all encoding issues. As a result such a class should give you proper encoded string or at least the encoding that's used for your input.

(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)   

Jsoup can now parse that input with a known encoding.

I guess changes on the html-creation thing is not possible, isn't it?

Some further readings:

Community
  • 1
  • 1
ollo
  • 24,797
  • 14
  • 106
  • 155
  • Thanks. Reading : http://www.joelonsoftware.com/articles/Unicode.html has alos helped. – user3319710 Mar 11 '14 at 10:11
  • also helped. Now I undersatnd better charset versus encoding. I have all settings set to UTF-8. I put some Chinese – user3319710 Mar 11 '14 at 11:33
  • sorry for the garble. Googling brings up the suggestion that jsoup does not support Chinese and Japanese character sets. Is that the case please. All else seems to work fine under UTF-8. – user3319710 Mar 11 '14 at 11:35
  • I've not tested jsoup with asian character sets, but does it fail too, if you use java's method to handle strings and then parse it with jsoup. – ollo Mar 11 '14 at 13:41