0

I'm currently working on a regex to replace empty HTML elements. However, the strings in the database contain hidden chars. For example, in the database I copy this string:

<h3> </h3>

When I loop over it and convert each character into an integer with ord, I get the following output:

< => 60
h => 104
3 => 51
> => 62
=> 32
< => 60
/ => 47
h => 104
3 => 51
> => 62

However, when I read it from the database and put it into a variable directly, I get the following output:

< => 60
h => 104
3 => 51
> => 62
� => 194
� => 160
< => 60
/ => 47
h => 104
3 => 51
> => 62

I know the 160 is a non-breaking space, so I know this could be correct. However what I don't get is why I get an extra char 194 (which is  according to google).

How can I get rid of the  I get? The non-breaking space is understandable but I don't get the Â.

UPDATE:

The data in the database is stored as utf8_general_ci. I set the charset in the PDO connection to utf8.

UPDATE2:

I'm curious why I get an  (char 194) to begin with. Between

and

in the database there's one character according to my cursor.

I want to remove <h3>[ONLY SPACES]</h3> but because it contains a random char 194 I cannot replace it correctly with regex since 194 isn't a space.

Joshua Bakker
  • 2,288
  • 3
  • 30
  • 63
  • Please, provide more details. How do you store value in the database and how do you read it? (character set, collations, etc). – Timurib Apr 13 '18 at 09:24
  • 1
    Looks like a [character encoding issue](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through/279279) – CD001 Apr 13 '18 at 09:25
  • See also https://3v4l.org/b9JB8 – Timurib Apr 13 '18 at 09:27
  • *"The data in the database is stored as utf8_general_ci. I set the charset in the PDO connection to utf8."* - OK, what about the HTTP header, `header('Content-Type: text/html;charset=utf-8')` and `` ? – CD001 Apr 13 '18 at 09:38
  • Doesn't change, it still gives me back a  – Joshua Bakker Apr 13 '18 at 09:39

3 Answers3

2

PHP's ord() function, like all the main built-in functionality, doesn't know anything about character encoding, it just sees the string as a series of bytes. All it does is look at a single byte of the string, and tell you the value of that byte as a number between 0 and 256.

However, your text is in UTF-8, where some characters take more than one byte; so when you look through one byte at a time, any numbers higher than 127 are actually one part of a longer sequence. So, there is no "Â".

What's really there is the sequence of bytes "194, 160"; or expressed in hexadecimal "C2 A0". If you look that up in a conversion tool such as this one, you'll see that that sequence of bytes in UTF-8 represents Unicode code point A0, or 160, which you already found was a non-breaking space.

So that's it: your string is correctly encoded, but contains one character that you didn't see, because it's a special type of space.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • So the 194 and 160 combined is one non-breaking space? And how do I include it in the regex then? – Joshua Bakker Apr 13 '18 at 10:34
  • @JoshuaBakker Yes. The regex is probably best left to a separate question, and you may be able to find an answer by searching for "php utf8 regex" or similar. I believe you have to add a flag to the regex to tell it to read the string as UTF-8 rather than single bytes. – IMSoP Apr 13 '18 at 10:39
  • I thought I could use this: `/<\w*>[\s\xA0]*<\/\w*>` where \xA0 is the right character it seems but it doesn't work. But if you suggest me to make another question I'll do so. – Joshua Bakker Apr 13 '18 at 10:41
  • Nevermind, seems like using the 'u' modifier in the regex works and the non-breaking space will be replaced correctly. Much thanks for your explanation. – Joshua Bakker Apr 13 '18 at 10:55
0

use php iconv function in the loop to replace special chars from db

$text = "This is the Euro symbol '€'."; $op = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;

Ananth
  • 11
  • 1
  • I try to echo the characters with the `iconv` function, however this gives me some errors: `Notice: iconv(): Detected an incomplete multibyte character in input string` for 194 and `Notice: iconv(): Detected an illegal character in input string` for 160 – Joshua Bakker Apr 13 '18 at 09:32
  • @JoshuaBakker You were feeding `iconv` one byte of the string, and it was quite rightly telling you it needed both bytes at once to interpret them as UTF-8. – IMSoP Apr 13 '18 at 10:42
0
you can send the text to specific function mentioned below


function ConvertToUTF8($text){

    $encoding = mb_detect_encoding($text, mb_detect_order(), false);

    if($encoding == "UTF-8")
    {
        $text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');    
    }


    $out = iconv(mb_detect_encoding($text, mb_detect_order(), false), "UTF-8//IGNORE", $text);


    return $out;
}
Ananth
  • 11
  • 1