0

I have a problem that I couldn't figure out so far and I'd appreciate any help.

I have the following simple code:

<?php
header("Content-Type: text/html; charset=utf-8");

$body .= "begrüßen zu dürfen";

echo htmlentities($body);
echo htmlentities($body, ENT_COMPAT,'UTF-8');

?>

The first echo works while the second returns an empty string. Why does this happen?

The variable $body is a combination of a fixed string like "begrüßen zu dürfen" and some text that comes from a mysql database with UTF-8. If I want to display the text from the DB correctly, let's call it $data, I need to use htmlentities($data, ENT_COMPAT,'UTF-8');, so I was thinking that I can use htmlentities($body, ENT_COMPAT,'UTF-8') to display the whole combined text (partly from DB and partly from a fixed string). However, this does not work.

Any idea how to solve this?

Anirudh Ramanathan
  • 46,179
  • 22
  • 132
  • 191
  • 1
    Is your file saved in UTF-8? Since the fixed string is the one that's not working, it's very likely your php file actually isn't saved in UTF-8 – Esailija Jul 19 '12 at 12:47
  • No it was in cp1252 (whatever this is :) ). If I do this, and use htmlentities($body, ENT_COMPAT,'UTF-8') it shows now some strange characters such as ��. Also if I change the filecode to UTF-8 then in the eclipse text editor it shows the "ü" as strange characters. Is there a way such that in eclipe I can type as usual ü but then on the web-page it is shown correctly and works fine together with some utf-8 encoded data from the DB when concatenated? Many thanks! – user1513073 Jul 19 '12 at 12:54
  • Listen to what we all say, convert your file to UTF8. That will fix your problems m8. – Peon Jul 19 '12 at 13:01

3 Answers3

1

The second case returns a blank string because it encounters "invalid code unit sequences" in that string. The following does work and returns everything except the unicode characters.

echo htmlentities($body, ENT_QUOTES | ENT_IGNORE ,"UTF-8");

ENT_IGNORE silently discard invalid code unit sequences instead of returning an empty string.

The reason you are encountering invalid sequences is because the encoding for your php-file is incorrect.

Anirudh Ramanathan
  • 46,179
  • 22
  • 132
  • 191
0

Check if your file is saved as UTF8, this works for me just right:

header( "Content-Type: text/html; charset=utf-8" );
$body = "begrüßen zu dürfen";
echo htmlentities( $body );

Output:

begrüßen zu dürfen
Peon
  • 7,902
  • 7
  • 59
  • 100
0

You must save your php file in UTF-8, not in CP1252.

To test you are doing it correctly, try:

<?php
header("Content-Type: text/plain; charset=utf-8");
die("öäå");

If this shows up strange characters, the file in question was not saved in UTF-8 properly.

Note that if you are using UTF-8 properly, you do not need to use htmlentities.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • die works correctly but if I do htmlentities it does not work, neither with parameter 'UTF-8' (when it shows nothing), nor without (when it shows strange character) – user1513073 Jul 19 '12 at 13:03
  • @user1513073 if your motivation to use `htmlentities` is to escape html, you should be using `htmlspecialchars`. If you cannot support UTF-8 at all, you should use `htmlentities` which can encode any unicode code point with ASCII characters. But you since you can use UTF-8 I do not see a reason for using htmlentitites at all. – Esailija Jul 19 '12 at 13:04
  • The reason is that I use htmlentities because later I combine my fixed string with a text that I get from a database in utf-8. And this text is only displayed correctly when I use htmlentities(...,"UTF-8"). Sorry for my confusion... this UTF-8 stuff is just extremely confusing for me :) – user1513073 Jul 19 '12 at 13:07
  • @user1513073 if you have a utf8 fixed string (file was saved in utf-8), and you combine it with a string from database, and it doesn't work, the database didn't actually give you utf8 string. Or you are using PHP functions that don't understand utf-8 and corrupt the string (Such as `subtsr`). See http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Esailija Jul 19 '12 at 13:11
  • How can I figure out if the DB gives me UTF-8? Before I do a query I make a call $mysql->exec("set names 'utf8'") so I thought that all data shall arrive in the format. Notice that when I print such data I always get strange characters unless I use the UTF-8 format together with htmlentities or htmlspecialchars. The only string operation I use is concatenation, i.e., . The strange thing is that when I use ENT-IGNORE optione, then it works all over sudden. – user1513073 Jul 19 '12 at 13:16
  • @user1513073 do the same quick test, except `die($string_from_mysql_that_has_umlauts);` It's probably good thing to get to the bottom of this and understand the real cause. – Esailija Jul 19 '12 at 13:18
  • Yes, I think so too. I tried what you suggested with the die and it works fine, so I am still not sure what is the reason for this behavior. Also, why does the approach of DarkXphenomenon work? I'd appreciate a lot your suggestions. – user1513073 Jul 19 '12 at 14:32
  • @user1513073 if you verified using the `die` method that database sends you utf8 strings and your files are saved as utf8, then that leaves interaction with the strings. As I noted before, many php functions corrupt utf8 strings. You should see if you are doing anything with the strings. if you simply `echo $database_string.$hard_coded_string` it should work. His method works because he is making the function ignore invalid code sequences, but there should be no invalid code sequences in the first place. – Esailija Jul 19 '12 at 14:38
  • It was a str_replace apparently...thanks again for your valuable help! – user1513073 Jul 19 '12 at 21:23
  • @user1513073 yeah that one can corrupt utf-8 strings because it's not aware of utf-8. You should use mb extension in general with internal encoding set to utf8. Here's more http://php.net/manual/en/book.mbstring.php – Esailija Jul 19 '12 at 21:34