4

I am using HtmlCleaner library in order to parse/convert HTML files in java.

It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'

Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:

CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);
Rup
  • 33,765
  • 9
  • 83
  • 112
choop
  • 921
  • 2
  • 9
  • 28

2 Answers2

2

HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.

You can either

  • specify -Dfile.encoding=UTF-8 on your JVM start line
  • use the HtmlCleaner.clean() overload that accepts a character set

    TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
    

    (if you've got Google Guava in the project you can use Charsets.UTF_8 for the constant)

  • use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you've already constructed with the correct character set.
Rup
  • 33,765
  • 9
  • 83
  • 112
0

You can change UTF-8 to UTF-16.

It will support maximum number of characters.

Matthias
  • 7,432
  • 6
  • 55
  • 88
Azhar
  • 1
  • 1
  • But they're just encodings - that won't change the number of characters that's supported. This might help if HtmlCleaner is reading the file with the wrong encoding and UTF-16 is generated with a BOM that it detects correctly, but I doubt it would. – Rup Apr 25 '12 at 11:58
  • @Azhar can you explain, in your own words, why you think UTF-16 has more characters than UTF-8, and where you got the idea from? – Mr Lister Apr 25 '12 at 15:00
  • @MrLister.. When i started coding my first HTML.. I had an issue with the Supporting characters ... So i had my senior to help me out :).. He told me UTF-16 will support more characters .. Correct me if me or my senior is wrong.. – Azhar Sep 15 '15 at 13:32
  • @Azhar Your senior was wrong. UTF-8 supports exactly the same character set as UTF-16: all Unicode codepoints from U+0000 to U+10FFFD. (An early draft of UTF-8 provided for even more characters, but they scrapped that idea in favour of compatibility with the UTF-16 range.) – Mr Lister Sep 15 '15 at 14:04