PHP get rid of Â in database output

Question

I'm currently working on a regex to replace empty HTML elements. However, the strings in the database contain hidden chars. For example, in the database I copy this string:

<h3> </h3>

When I loop over it and convert each character into an integer with ord, I get the following output:

< => 60
h => 104
3 => 51
> => 62
=> 32
< => 60
/ => 47
h => 104
3 => 51
> => 62

However, when I read it from the database and put it into a variable directly, I get the following output:

< => 60
h => 104
3 => 51
> => 62
� => 194
� => 160
< => 60
/ => 47
h => 104
3 => 51
> => 62

I know the 160 is a non-breaking space, so I know this could be correct. However what I don't get is why I get an extra char 194 (which is Â according to google).

How can I get rid of the Â I get? The non-breaking space is understandable but I don't get the Â.

UPDATE:

The data in the database is stored as utf8_general_ci. I set the charset in the PDO connection to utf8.

UPDATE2:

I'm curious why I get an Â (char 194) to begin with. Between

and

in the database there's one character according to my cursor.

I want to remove <h3>[ONLY SPACES]</h3> but because it contains a random char 194 I cannot replace it correctly with regex since 194 isn't a space.

Please, provide more details. How do you store value in the database and how do you read it? (character set, collations, etc). — Timurib, Apr 13 '18 at 09:24
Looks like a [character encoding issue](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through/279279) — CD001, Apr 13 '18 at 09:25
*"The data in the database is stored as utf8_general_ci. I set the charset in the PDO connection to utf8."* - OK, what about the HTTP header, `header('Content-Type: text/html;charset=utf-8')` and `` ? — CD001, Apr 13 '18 at 09:38

score 2 · Accepted Answer · answered Apr 13 '18 at 10:26

2

PHP's ord() function, like all the main built-in functionality, doesn't know anything about character encoding, it just sees the string as a series of bytes. All it does is look at a single byte of the string, and tell you the value of that byte as a number between 0 and 256.

However, your text is in UTF-8, where some characters take more than one byte; so when you look through one byte at a time, any numbers higher than 127 are actually one part of a longer sequence. So, there is no "Â".

What's really there is the sequence of bytes "194, 160"; or expressed in hexadecimal "C2 A0". If you look that up in a conversion tool such as this one, you'll see that that sequence of bytes in UTF-8 represents Unicode code point A0, or 160, which you already found was a non-breaking space.

So that's it: your string is correctly encoded, but contains one character that you didn't see, because it's a special type of space.

answered Apr 13 '18 at 10:26

IMSoP

89,526
13
117
169

So the 194 and 160 combined is one non-breaking space? And how do I include it in the regex then? – Joshua Bakker Apr 13 '18 at 10:34
@JoshuaBakker Yes. The regex is probably best left to a separate question, and you may be able to find an answer by searching for "php utf8 regex" or similar. I believe you have to add a flag to the regex to tell it to read the string as UTF-8 rather than single bytes. – IMSoP Apr 13 '18 at 10:39
I thought I could use this: `/<\w*>[\s\xA0]*<\/\w*>` where \xA0 is the right character it seems but it doesn't work. But if you suggest me to make another question I'll do so. – Joshua Bakker Apr 13 '18 at 10:41
Nevermind, seems like using the 'u' modifier in the regex works and the non-breaking space will be replaced correctly. Much thanks for your explanation. – Joshua Bakker Apr 13 '18 at 10:55

score 0 · Answer 2 · answered Apr 13 '18 at 09:27

0

use php iconv function in the loop to replace special chars from db

$text = "This is the Euro symbol '€'."; $op = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;

answered Apr 13 '18 at 09:27

Ananth

11
1

I try to echo the characters with the `iconv` function, however this gives me some errors: `Notice: iconv(): Detected an incomplete multibyte character in input string` for 194 and `Notice: iconv(): Detected an illegal character in input string` for 160 – Joshua Bakker Apr 13 '18 at 09:32
@JoshuaBakker You were feeding `iconv` one byte of the string, and it was quite rightly telling you it needed both bytes at once to interpret them as UTF-8. – IMSoP Apr 13 '18 at 10:42

score 0 · Answer 3 · answered Apr 13 '18 at 09:42

0

you can send the text to specific function mentioned below


function ConvertToUTF8($text){

    $encoding = mb_detect_encoding($text, mb_detect_order(), false);

    if($encoding == "UTF-8")
    {
        $text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');    
    }


    $out = iconv(mb_detect_encoding($text, mb_detect_order(), false), "UTF-8//IGNORE", $text);


    return $out;
}

answered Apr 13 '18 at 09:42

Ananth

11
1

It now gives me back 194 converted as a '?' – Joshua Bakker Apr 13 '18 at 09:44
I think this doe4snt work in linux environment so you can use str_replace(array("ā","ī"), array("a","i"), "your text here"); – Ananth Apr 13 '18 at 09:55
I'm using Windows. And these characters don't exist. Like the Â isn't displayed as that in the database. And it shouldn't add something. – Joshua Bakker Apr 13 '18 at 09:56

PHP get rid of Â in database output

and

3 Answers3