0

When I attempt to write data to my SQL Server db from this website:

http://kgd.gov.kz/en/wanted?page=0

The fields display like this:

удоÑтоверение личноÑти â„–024806818 от 19.01.2010 МЮ РК

My Server's Collation is set to `SQL_Latin1_General_CP1_CI_AS. The fields are set to nvarchar(max)

How can I display these characters correctly? I pull data from another site that appears to be in Russian (I think these are russian characters?) and they display correctly.

I'm on SQL Server 2012. I have researched this on S/O and the only two threads I found did not offer a valid solution.

Thanks.

Stpete111
  • 3,109
  • 4
  • 34
  • 74
  • 1
    My guess is you are sticking this data into a varchar datatype. You need to use nvarchar to accommodate the extended character set. – Sean Lange Mar 28 '17 at 14:01
  • 1
    This is outside of sql problem - website pushes strings with uncontrolled and/or unsupported encoding. So you have to prepare your database to store _unicode?_ and to modify your website to make it push ensured encoding all the time like _utf8?_. Right now this is utf8 saved as cp1252 (so SeanLange is right about `varchar` type (which is ascii) usage instead of `nvarchar` for unicode support). – Ivan Starostin Mar 28 '17 at 14:07
  • 2
    Answer on "what is a collation" question: http://stackoverflow.com/questions/4538732/what-does-collation-mean It has nothing to do with your problem. – Ivan Starostin Mar 28 '17 at 14:08
  • @SeanLange I edited my post to include that my fields are in fact nvarchar(max) – Stpete111 Mar 28 '17 at 14:30
  • 1
    It's not a problem with SQL Server. You used ASCII text instead of Unicode and tried to read it with a *different* codepage than the one that was used to store it. – Panagiotis Kanavos Mar 28 '17 at 14:32
  • @IvanStarostin ok just to make sure I understand correctly, the collation SQL_Latin1 collation setting of my db has nothing to do with the problem I'm seeing? – Stpete111 Mar 28 '17 at 14:32
  • @Stpete111 Unicode data isn't affected by codepages/collations. Either the field isn't Unicode or the Unicode text is converted to ASCII before it's displayed. How did you store the text anyway? What application did you use? Did you try storing UTF8 instead of UTF16 perhaps? – Panagiotis Kanavos Mar 28 '17 at 14:33
  • Ehmmmm, that page shows up OK. Did you hard-code an encoding on the *web page* perhaps? – Panagiotis Kanavos Mar 28 '17 at 14:35
  • No repro on Edge, Chrome, IE11 as long as the encoding is left to its default value. The `charset` meta tag is set to UTF8. Did you switch your **browser's** encoding to Latin1? That's the only way the page will appear with mangled text. – Panagiotis Kanavos Mar 28 '17 at 14:36
  • @spete111 correct. Your problem is about storing and encoding, not sorting/comparing strings. – Ivan Starostin Mar 28 '17 at 14:46
  • @IvanStarostin this isn't about storing. The page displays correctly unless one modifies the encoding from the browser's menu – Panagiotis Kanavos Mar 28 '17 at 14:48
  • @PanagiotisKanavos That looks to me like a "robustness" made by two mirror bugs. And you described how to make it fail. – Ivan Starostin Mar 28 '17 at 15:00
  • @IvanStarostin no, I just clicked on the link. The page displays without problems, eg: `Wanted person full name: КАСКАТАЕВ ТАИР ЕРКИНОВИЧ`. The only way to get garbled text is to actually go to the browser's menu and change the encoding to `Western Europe`. That produces `Wanted person full name: КÐСКÐТÐЕВ ТÐИРЕРКИÐОВИЧ`. Either the OP changed the browser's encoding without realizing it, or the link isn't the actual page – Panagiotis Kanavos Mar 28 '17 at 15:03
  • @IvanStarostin the actual text seems to be the `Identity Documents` section. `Identity documents: Удостворение личности №039344681 от 02.12.2015 вылан МВД РК` with UTF8 appears as `dentity documents: УдоÑтворение личноÑти â„–039344681 от 02.12.2015 вылан МВД РК` if encoding is set to Western Europe – Panagiotis Kanavos Mar 28 '17 at 15:07
  • @PanagiotisKanavos yes, just as I said: uncontrolled conversion from multibyte string to single-byte at first and uncontrolled displaying and conversion of broken string to _whatever it is_ in the end. It does look like correct behaviour when **you guess** the right encoding. – Ivan Starostin Mar 28 '17 at 15:35

0 Answers0