3

If I didn't cut my hair so short, I would have pulled it all out already because of this problem lol! Any help is greatly appreciated, really, I'm going crazy because of this!!

So I have a string of data coming from (not my choice) a latin1 table in a mysql database, that looks like this:

 Hi! I'm a string of text .

That symbol at the end is an emoji unicode character, U+1F61C (it's a wacky-looking smiley face). I couldn't figure out how to show it properly here in this question, but anyways, when I output the string to a browser in an html document (encoded for utf-8) I'm able to see it just fine.

<html>
  <head>
    <meta charset='utf-8'>
  </head>
    <body>
      <?php echo $text; ?> // outputs the string with the emoji showing correctly
    </body>
</html>

My basic problem is that I'm trying to remove this emoji symbol from the $text string. Or rather, I'm trying to remove any non-punctuation and non-alphanumeric characters from strings that I'm getting out of the database (my program just needs to grab the normal conversation text, and nothing else frilly).

Well, I figured I'd start by trying to remove just the emoji characters, so I looked around stackoverflow and found this example. Unfortunately, it doesn't work --- the emoticon doesn't get removed at all, and the string just stays the same.

// Outputs the original string
echo preg_replace( '/[\x{1F600}-\x{1F64F}]/u', '', $text );

I then figured, why not try to just remove all the non-punctuation and non-alphabetic characters like I wanted to in the first place? So I looked around stackoverflow and found this example. But oddly enough, it doesn't work either --- the string remains the same as before.

// Also outputs the original string
echo preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $text );

So I'm thinking, that's weird, it should have at least removed the punctuation, right? Maybe something is wonky with the string? So I tried running mb_detect_encoding() on it to see what PHP was detecting, and the ouput said "ASCII".

// Outputs "ASCII"
echo mb_detect_encoding( $text, mb_detect_order(''), true );

I guess I was wondering, does that seem like a strange result for it to return? If I understand correctly, isn't ASCII only a small set of characters that doesn't include the emoji unicode symbols? But maybe, the broader question might be why the punctuation removal code isn't working, and I figured maybe I was using preg_replace wrong. So I tried preg_replace again on a different set of characters to see:

// Outputs "Hi! I'm a text ."
echo preg_replace( '/string of/', '', $text );

...and that worked just fine. I'm puzzled!

So I'm figuring, I guess something's screwy with the data from the database, maybe I should try to force the string encoding to utf-8? So I tried the following code, which also doesn't work, I'm guessing since PHP is already detecting the string as ASCII and so it doesn't do the conversion to utf-8? I dunno'.

//  Outputs "ASCII" still, and also the original string
$text = iconv( mb_detect_encoding( $text, mb_detect_order(''), true ), "UTF-8", $text );
echo mb_detect_encoding( $first_post_text, mb_detect_order(''), true );
echo preg_replace( '/[\x{1F600}-\x{1F64F}]/u', '', $text );

I even tried just a flat out utf8_encode() on the string (since I figured the data is coming from a latin1 database so maybe it's encoded in ISO-8859-1...maybe?) but also to no luck --- it's still the same string, and it's still saying that it's ASCII, which doesn't seem right.

Finally, I figured maybe something's wrong with the preg_replace function itself, but here's the odd part of it --- you remember that simple html document from up above? Well,l I decided to create a simple form that sends the whole document (using javascript) over a POST variable to another PHP page (html tags, text and all). And when I'm on that next page and run mb_detect_encoding() on the POST data, it actually outputs UTF-8 --- and not just that, when I run the preg_replace code from above, it's working!

Does anyone have any thoughts on what might be going wrong? Any help with this would be greatly appreciated! I'm admittedly not good friends with character encoding, and I'm going bonkers trying to figure this all out!

Community
  • 1
  • 1
Zero Wing
  • 233
  • 2
  • 12

1 Answers1

0

One possible explanation:

The string replace would fail if the database does not contain the unicode character itself, but just the html entity (&#128540; or &#x1f61c;). It would also explain how a unicode character is surfacing in the latin1 character set, and also the detected ascii encoding. something like

echo str_replace( array('&#128540;','&#x1f61c;'), '', $text );

would work in that case.

chiliNUT
  • 18,989
  • 14
  • 66
  • 106