I have a (large) body of text that I'm working to try and convert from it's originally web-friendly format, to something 'slightly' more restrictive (epub -- and some readers are VERY picky about the HTML they take in).
HTML purifier is working wonderfully for one class of issues, which I'll call 'bad coding'. Things like missing closed parenthesis (which is technically legal HTML) and other annoyances that a browser automatically works around.
Where the HTML purifier is not working great is when it runs into an encoding issue. Many of the characters were saved in a Ӓ format, which (apparently?) HTML purifier doesn't care for. Maybe I just need to configure it better. Another issue is the bane of my existence: curly quotes, em-dashes, and the like. I've managed to do a mass search-and-replace on a number of those issues, but what concerns me is that I may have missed a character somewhere (as brought home by running into a case of deja vu spelled with the accent and grave marks included).
Is there any way to get HTML purifier to tell me that there was an issue with such characters, rather than silently stripping them? I'm trying to look through the code, but the software is very much designed for a different use-case scenario ('silently' handling user input, rather than a programmer doing mass-conversions on text bodies), and I'm just not seeing the data I'm looking for.