2

I am having a problem with  character on my website.

I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).

The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.

When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).

How can I prevent and/or remove this character so I can run the regular expression?

EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.

animuson
  • 53,861
  • 28
  • 137
  • 147
kkeith29
  • 33
  • 7
  • what is your regular expression doing? – David Nguyen Apr 12 '12 at 17:29
  • It is removing empty paragraph tags. For some reason users like to add extra lines when they edit which makes the website look horrible. It should remove paragraph tags with only whitespace and/or a nbsp; entity. Example: http://dev.lovewichita.org/church/profile/25.html – kkeith29 Apr 12 '12 at 17:32
  • +1 for helping the church out – ANisus Apr 12 '12 at 18:39
  • Could you add the failing regexp? Then I can try to recreate the problem locally – ANisus Apr 12 '12 at 18:40
  • The regex is: `'#

    ([\s\r\n]*)( )?([\s\r\n]*)

    #'`. I threw it together pretty quick so I know there is a better way to write it. I use to be good at the syntax but it seems my memory is fading.
    – kkeith29 Apr 12 '12 at 19:08

2 Answers2

1

You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.

Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.

Additionally to find the character you can't see, create a hexdump of the string.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I copied and pasted from the older version of the website. Would the text not get converted to a format readable under the UTF-8 charset? – kkeith29 Apr 12 '12 at 21:03
  • @kkeith29: That depends. Using UTF-8 does not mean that magically everything works now, it's just a character encoding. I think it's most informative if you add the code you've got problems with to your question and the hexdump of the string you run into problems with. – hakre Apr 12 '12 at 21:46
  • The code that produces the text is spread throughout the framework (form class, controllers, models, and helpers) so it is hard to post here. Thank you for mentioning the hexdump, it made do a lot of research as to how that would help and it greatly expanded my knowledge of how data is turned into text and how charsets play into that. Thanks to you I confirmed it is a charset problem with that text (a space is the culprit, it is being dislayed as two characters, Â and a space, due to multi-byte stuff from what I understand). – kkeith29 Apr 13 '12 at 05:53
  • 1
    It actually kind of sad after 7 years, it took me till now to take the time to research that and understand it better. – kkeith29 Apr 13 '12 at 05:55
1

Since the character you are talking about exists within the ANSI charset, you can do this:

utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));

This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:

string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )

ANisus
  • 74,460
  • 29
  • 162
  • 158