1

Certain characters get mangled after I call Tidy.parse. Two examples are: ’ instead of ' and ∼ instead of ~

I'm guessing that these must have come from Word or something similar but the tidy handles them very badly. Specifically, it converts them to their individual entity representations for the diacritics which then get converted to meaningless junk later in my process. I'm sure there are others but these are the ones I have found so far. Is there any known way to convert these before hand or ignore them as part of the tidy?

        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        tidy.setForceOutput(true);
        tidy.parse(inputStream, outputStream);
ArcticDoom
  • 64
  • 6
  • an answer to another bork replacement question shows how to see your current configuration for more clues .. https://stackoverflow.com/a/2608969/11287237 – ocæon Apr 16 '19 at 15:38
  • Thanks, that will help me at least see how it was set up. – ArcticDoom Apr 16 '19 at 15:44

1 Answers1

1

After printing out the config, I could see that the input and output encodings were not set to UTF-8 as I had thought so I just had to add this:

tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
ArcticDoom
  • 64
  • 6