Multilingual site - character enconding

Question

I know this problem is almost as old as a world and thousands of answers exists in the web, but I still cannot find what is a problem in my case and why characters shows as black question marks (�) :(

We have a multilingual site that currently supports 10 languages. Some characters are displayed incorrectly (ве��сией, 联合国��际). It can happen with regular characters in non Latin languages, and in other words on same page, the same characters are displayed correctly. In Latin languages, all special and regular characters are displayed correctly.

I tried to play with encoding, but when in one place it fixes the problem the problem appears in other place.

Here, how my encodings configured:
1) In MS SQL Server, we use NVARCHAR(MAX) column with SQL_Latin1_General_CP1_CI_AS collation.
2) In web application, in web.config file I have: <globalization requestEncoding="utf-8" responseEncoding="utf-8" />.
3) On page itself, we have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.

In response headers, Chrome shows: Content-Type:text/html; charset=utf-8.

What I miss? Why I still see those black question marks? What should I check/change in order to display all characters correctly.

Thanks

UPDATE

I found a problem and it is totally not related to transport encoding. I thought the problem is with encoding, in way how it passes DB -> ASP.NET -> Browser, but after lots of debugging, I found that the problem is in way, how the output has been written to HttpContext.Current.Response.Filter....we have our custom filter, and somehow, the buffer (byte[]) that was passed to the Write method of filter. It has corrupted array of Unicode string, so sometimes the last char of the string in bytes, was translated as gibberish. I still not found how to solve it correctly, but for now, i can disable our filter and there is no question marks any more.

Thanks to all.

score 0 · Answer 1 · answered Apr 22 '13 at 10:29

0

I don't know about MS SQL server, but have you tried having it use UTF-8 encoding instead of latin-1? A quick Google search shows:

DEFAULT CHARACTER SET utf8;
DEFAULT COLLATE utf8_general_ci;

I would think that that would be a better option to use than SQL_Latin1_General_CP1_CI_AS.

answered Apr 22 '13 at 10:29

interestinglythere

1,230
13
16

I didn't found anything with UTF-8 collations in MS SQL. I found a suggestion to use "BIN2" collation, but it didn't help. – Alex Dn Apr 22 '13 at 10:32
I found [this](http://stackoverflow.com/questions/12512687/sql-server-utf8-howto) and now have even more reason to never touch MSSQL, even with a 10' barge pole. ick. – Quentin Apr 22 '13 at 10:36
@Quentin actually I don't think that the problem is within DB, since the same character can appear correctly sometime...the problem that if one character is OK, then another one is question mark...in addition, when I use the content of DB in places other then web, I don't see any problems :( – Alex Dn Apr 22 '13 at 10:40

tripleee · Answer 2 · 2013-04-22T11:09:42.657

0

If the page renders in a font which lacks those glyphs, they will be rendered with placeholders.

For example, on my phone, several of the examples you say are displaying correctly for you are shown to me with placeholders for some of the text.

enter image description here

edited Apr 22 '13 at 11:09

answered Apr 22 '13 at 10:41

tripleee

175,061
34
275
318

We are using "arial,tahoma" font definition through CSS...as I know it pretty standard and web safe fonts. The strange thing is that the same "letter/character" looks good in one word, but in other word it's question mark. – Alex Dn Apr 22 '13 at 10:46
The placeholders that you see, are question marks for me...so it's displayed incorrectly for me and for you. The difference is what incorrect character each one of us see...I almost sure the problem somewhere between web-server and browser, but can't find what exactly I should change :( – Alex Dn Apr 22 '13 at 10:51
E.g. ве��сией displays with placeholders on my desktop workstation as well. What are the two characters after "ве" supposed to be? Unicode code points would be a good, unambguous representation. (For example, в is [U+0432](http://www.fileformat.info/info/unicode/char/432/index.htm) if I am not mistaken.) – tripleee Apr 22 '13 at 11:12
Try http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder for decoding / decomposing if you don't know what you have. – tripleee Apr 22 '13 at 11:19
After "ве" supposed to be a single character "p" - U+0440. After I played with encoding in DB and META tag, this word displayed good, but then "ОДНОГ��" looks bad. "Placeholder" in this case supposed to be single character "U+041E" - russian capital O. – Alex Dn Apr 22 '13 at 11:26
Do you know if ‎ (left-to-right mark) can cause such problems? I found that when I remove that mark from the problem word, it become valid... – Alex Dn Apr 22 '13 at 11:47
Again, the actual bytes would be the only sane starting point for any troubleshooting effort. – tripleee Apr 22 '13 at 12:24
Where I can take that bytes? In DB I see the context without any problems. While debugging, in line that inserts the content to div.innerHTML I see the content without any problem...the problem not happens with each text, but where it happens it's a big content, so find these specific bytes will be not an easy task :( I can try to catch the bytes in End_Request, but don't know if it's correct place. – Alex Dn Apr 22 '13 at 12:48
Ok, I found the problem...it's not related to encoding at all. See update in my question :) – Alex Dn Apr 22 '13 at 15:24

Multilingual site - character enconding

2 Answers2