9

I have the following two fields in a Sql Server table:

When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?

Basically, I thought:

  • VARCHAR = ASCII
  • NVARCHAR = Unicode

So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?

I can see the ç and é characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?

(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)

Edit

  • DB collation: Latin1_General_CI_AS
  • Server Version: 12.0.5223.6
  • Server Collation: SQL_Latin1_General_CP1_CI_AS
halfer
  • 19,824
  • 17
  • 99
  • 186
Pure.Krome
  • 84,693
  • 113
  • 396
  • 647
  • 1
    You need to have NVARCHAR datatype. and while saving the data you need to prefix N' to avoid data corruption. use : N'This home is in prime location...' in insert and check – Ketan Kotak Sep 05 '19 at 07:09
  • 1
    Thanks for the comment but I don't think you read my question properly? Currently, the datatype is `VARCHAR` and currently, the db _is storing_ those accented characters. Please re-read the post to see the details and the question I'm asking. Thanks for reading my post, though :) – Pure.Krome Sep 05 '19 at 07:19
  • sorry didn't read exactly. in my regional language if I save with same varchar it doesn't work. I need to read some more sql now :) – Ketan Kotak Sep 05 '19 at 09:39
  • If you're storing non-ASCII data in the database you should definitely use `nvarchar` column types. When using `varchar` columns any characters in the 128-255 range are subject to the [`COLLATION`](https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017) settings applied to the database, table or specific columns and may be mangled when being inserted/updated. – AlwaysLearning Sep 05 '19 at 09:43
  • @AlwaysLearning thanks for the reply! Can you please provide some links about the 128-255 range characters that are subject to COLLATION settings _with respect to storing_. I read the doc for the link you supplied and I think it's applying to SORTING and ORDERING? – Pure.Krome Sep 05 '19 at 12:26
  • Possible duplicate of [Can the French and Spanish special chars be held in a varchar?](https://stackoverflow.com/questions/7182275/can-the-french-and-spanish-special-chars-be-held-in-a-varchar) – Cameron MacFarland Sep 05 '19 at 13:30
  • 1
    @Pure.Krome the links you ask for are the Wikipedia articles about codepages, ASCII and Unicode. No, `ç` isn't ASCII, it doesn't exist in the 7-bit US-ASCII codepage. It exists in *other* codepages like Latin 1. The collation does specify the codepage used to handle the stored text. If you use the wrong codepage while reading you'll get garbled text. That's very well documented. I suspect your database uses a Latin 1 collation, which is why you can store French characters. You won't be able to store Greek or Cyrillic characters though. – Panagiotis Kanavos Sep 05 '19 at 15:49
  • @Pure.Krome the reason you can type `façade` in this question and I can type `Αυτό Εδώ` is because StackOverflow is an ASP.NET application that stores text in nvarchar fields. Both ASP.NET and Windows use Unicode natively. – Panagiotis Kanavos Sep 05 '19 at 15:55
  • So, you found what you thought was ASCII isn't ASCII. You can skip that step by assuming that anywhere something is thought or said to be ASCII, it is unknown and probably not ASCII unless someone can cite a standard that says so. – Tom Blodget Sep 05 '19 at 22:12

2 Answers2

12

First the details of what Sql Server is doing.

VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).

NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.

As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.


Now to answer the question.

In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.

But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.

Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • This is a great answer @JoelCoehoorn - ta! Excluding the new SqlServer 2019 + UTF-8 stuff, this really answers my question : go and stick with `NVARCHAR` for user input. My users are global, so I should respect that. Cheers, much appreciated for the detailed answer! – Pure.Krome Sep 05 '19 at 22:51
0

VARCHAR is ASCII using the current system code page - so the set of characters you can save depends what code page.

NVARCHAR is UNICODE so you can store all the characters.

Merrion
  • 558
  • 4
  • 8
  • 2
    This is close, but not quite right. The code page is determined by the collation on the table, not the operating system. – Joel Coehoorn Sep 05 '19 at 16:00
  • @JoelCoehoorn And can be overridden on the field level too, which is handy for localised fields. – Luaan Sep 05 '19 at 16:17