czech char 'ě' on php page script

Question

I'm not able to correctly show this char on my web pages. I'm using UTF-8 charset for this page, have I to use ISO-8859-2? I'm getting this a string with this char from a db and on it, it's saved as ě. My Browser show only html tag.

It's the only char (at this moment) that I can't show on my webpage. I've take a look to the http://www.czech.cz and they use UTF-8.

any suggests?

take care! Andrea

http://www.fornacigrigolin.it/cz/ProdottiFamiglia.php?id_cat=2 — Andrea Girardi, Apr 26 '10 at 15:36

score 1 · Answer 1 · answered Apr 26 '10 at 15:26

1

Are you seeing the ě in the browser, or when you view source? If you're seeing it in the browser, then it's probably being double-encoded somewhere -- whatever outputs it to the page is probably detecting it as unencoded HTML and is trying to protect you from some kind of HTML-injection. You'll want to make it not do that. But you have an even deeper problem. If your page is served up in UTF-8, and your data is in UTF-8, there isn't any reason to turn it into an HTML entity in the first place. You should be passing through the UTF-8 data. You do not need to switch to a different character encoding.

answered Apr 26 '10 at 15:26

rmeador

25,504
18
62
103

It's my browser that it's not able to translate the '&@ 283;' code. There is also a problem. I've an admin page to upload on db the string, and before update on db I call $text=htmlentities($text,ENT_QUOTES);. For all other languages all is correct, but not for this char..... – Andrea Girardi Apr 26 '10 at 15:38
1

Use `htmlspecialchars` **not** `htmlentities`. `htmlentities` tries to encode all non-ASCII characters, which is needless and will corrupt them if you don't tell it the right character set. It defaults to nasty old ISO-8859-1. – bobince Apr 26 '10 at 15:49

score 1 · Answer 2 · answered Apr 26 '10 at 15:36

1

First of all, yes, you really should be using UTF-8. But that doesn't mean the data you have is already UTF-8 encoded.

Secondly, it sounds like that character is HTML encoded in the database already. This is a problem, because it seems that whatever page is displaying this character also tries to HTML-encode the content as well. Here's an example of what I'm talking about.

Data from user: ě
Data HTML encoded (via htmlentities()) prior to going into DB: ě
Data stored in DB: ě
Data retrieved from DB: ě
Data HTML encoded before being printed to the page: &#283;
Data as seen in the browser: ě

Do you see that? The character becomes double encoded, so that on the 2nd encoding step the ampersand character is converted into an entity itself.

This is the problem with HTML-encoding data before storing it in the database. That should only be done prior to displaying the content, not prior to storage.

answered Apr 26 '10 at 15:36

Peter Bailey

105,256
31
182
206

You are the man! It's exactly the problem..... So, I've to remove the htmlentities() prior to dong into DB, is it? – Andrea Girardi Apr 26 '10 at 15:44
But It's not so clear why on db I've this "Vrchní omítky a vyrovnávac&ia." and it's correctly shown on my browser... – Andrea Girardi Apr 26 '10 at 15:46
You should use `htmlspecialchars` when outputting text into the HTML page, and `htmlentities` never. Don't HTML-escape content going into the database. – bobince Apr 26 '10 at 15:50
Ok, I've found the problem. The char is coded on DB as ě How can I prevent this? – Andrea Girardi Apr 26 '10 at 15:53
I've remove the htmlentities and changed the charset to ISO-8859-1 and it works fine. – Andrea Girardi Apr 26 '10 at 16:03
It works only because your content is still HTML-encoded in the database. If you ever repair that data (or remove the step that encodes it) then ISO-8859-1 won't be sufficient. – Peter Bailey Apr 26 '10 at 16:05
I've suffered from this issue myself. I'm not sure yet about who's to blame about it but it happens when you need to store a character that is not allowed in the database character set. The only reasonable fix is to change the DB charset to another one with a wider character set, such as UTF-8, or reject the input data when in contains invalid chars. – Álvaro González Apr 26 '10 at 16:35

czech char 'ě' on php page script

2 Answers2