Can I use HTML purifier to find encoding issues instead of just stripping them?

Question

I have a (large) body of text that I'm working to try and convert from it's originally web-friendly format, to something 'slightly' more restrictive (epub -- and some readers are VERY picky about the HTML they take in).

HTML purifier is working wonderfully for one class of issues, which I'll call 'bad coding'. Things like missing closed parenthesis (which is technically legal HTML) and other annoyances that a browser automatically works around.

Where the HTML purifier is not working great is when it runs into an encoding issue. Many of the characters were saved in a Ӓ format, which (apparently?) HTML purifier doesn't care for. Maybe I just need to configure it better. Another issue is the bane of my existence: curly quotes, em-dashes, and the like. I've managed to do a mass search-and-replace on a number of those issues, but what concerns me is that I may have missed a character somewhere (as brought home by running into a case of deja vu spelled with the accent and grave marks included).

Is there any way to get HTML purifier to tell me that there was an issue with such characters, rather than silently stripping them? I'm trying to look through the code, but the software is very much designed for a different use-case scenario ('silently' handling user input, rather than a programmer doing mass-conversions on text bodies), and I'm just not seeing the data I'm looking for.

score 0 · Answer 1 · edited Apr 25 '23 at 13:27

0

I think this function mysql_real_escape_string($text) is usd for your problem

$text="It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).";

$main = mysql_real_escape_string($text);

edited Apr 25 '23 at 13:27

General Grievance

4,555
31
31
45

answered Apr 26 '13 at 06:15

Jitesh

101
1
2
14

1

mysql_* function are deprecated check this http://stackoverflow.com/questions/12859942/why-shouldnt-i-use-mysql-functions-in-php/14110189#14110189 – NullPoiиteя Apr 26 '13 at 06:19
Not only are they depreciated, but I don't want a system to simply *decide* for me what the appropriate replacement is. I want to know there's an issue and fix it myself. – RonLugge Apr 26 '13 at 06:23
@jitesh your comment appears to have been truncated... all I'm seeing is a link to the reference page that proves NullPointer's point that the function is depreciated and should be avoided. Additionally, it's based on the character set -- since the character set itself is the issue, that makes it a complete no-go. – RonLugge Apr 26 '13 at 16:20

Can I use HTML purifier to find encoding issues instead of just stripping them?

1 Answers1