JTidy not handling some characters correctly

Question

Certain characters get mangled after I call Tidy.parse. Two examples are: ’ instead of ' and ∼ instead of ~

I'm guessing that these must have come from Word or something similar but the tidy handles them very badly. Specifically, it converts them to their individual entity representations for the diacritics which then get converted to meaningless junk later in my process. I'm sure there are others but these are the ones I have found so far. Is there any known way to convert these before hand or ignore them as part of the tidy?

        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        tidy.setForceOutput(true);
        tidy.parse(inputStream, outputStream);

an answer to another bork replacement question shows how to see your current configuration for more clues .. https://stackoverflow.com/a/2608969/11287237 — ocæon, Apr 16 '19 at 15:38

score 1 · Accepted Answer · answered Apr 16 '19 at 15:49

1

After printing out the config, I could see that the input and output encodings were not set to UTF-8 as I had thought so I just had to add this:

tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");

answered Apr 16 '19 at 15:49

ArcticDoom

64
6

JTidy not handling some characters correctly

1 Answers1