0

I'm using HTMLCleaner to clean an HTML file which has characters like '€' (ascii decimal 128), 'TM' (ascii decimal 153), etc. That is, chars from the ASCII extended table.

HTMLCleaner cannot handle those chars and replaces them by character '?' (ascii decimal 63).

Is there any flag I can set in HTMLCleaner in order to process those chars?

Thanks in advance.

EDIT: The variable "encoding" is "iso-8859-1", just like the source file encoding.

    try {
        System.out.print("Parsing and cleaning:" + fileStr);
        URL url = new File(this.fileStr).toURI().toURL();
        // create an instance of HtmlCleaner
        HtmlCleaner cleaner = new HtmlCleaner();
        // default properties
        CleanerProperties props = cleaner.getProperties();
        // do parsing
        TagNode tagNode = new HtmlCleaner(props).clean(url);
        // serialize to XML file
        new PrettyXmlSerializer(props).writeToFile(tagNode, fileStr,
                encoding);
        System.out.println("Output: " + fileStr);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

I've just figured this out. The line:

TagNode tagNode = new HtmlCleaner(props).clean(url);

Shoube be replaced by:

TagNode tagNode = new HtmlCleaner(props).clean(url, encoding);

Where 'encoding' is the string representation of the charset of the source url.

Thank you!

anahnarciso
  • 397
  • 1
  • 4
  • 15
  • Possible duplicate: http://stackoverflow.com/questions/10299651/htmlcleaner-handle-spanish-characters – erikxiv May 16 '12 at 16:54
  • Yes, it was a similar problem, I checked that question but I didn't realize that it was an encoding problem. Thank you, you really helped me. – anahnarciso May 16 '12 at 17:19

1 Answers1

1

Did you try setting the charset?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Yes, as you can see above. The generated HTML file has the same encoding as the source file. – anahnarciso May 16 '12 at 17:03
  • I'd only set the charset in: `new PrettyXmlSerializer(props).writeToFile(tagNode, fileStr, encoding);` but I missed the `TagNode tagNode = new HtmlCleaner(props).clean(url, encoding);`. Now it works, thank you. – anahnarciso May 16 '12 at 17:16