46
  1. VARCHAR does not store Unicode characters.
  2. NVARCHAR does store Unicode characters.
  3. Today's applications should always be Unicode compatible.
  4. NVARCHAR takes twice the amount of space to store it.
  5. Point 4 doesn't matter because storage space is extremely inexpensive.

Ergo: When designing SQL Server databases today, one should always use NVARCHAR.

Is this sound reasoning? Does anyone disagree with any of the premises? Are there any reasons to choose VARCHAR over NVARCHAR today?

dkretz
  • 37,399
  • 13
  • 80
  • 138
Edward Tanguay
  • 189,012
  • 314
  • 712
  • 1,047
  • See this too http://stackoverflow.com/q/35366/27535 – gbn Sep 08 '11 at 17:51
  • This is _not_ sound reasoning, primarily due to invalid premises. Item 3 is too broad of a statement. Item 4 is partially obsolete since SQL Server 2008 introduced PAGE and ROW Compression, and 2008 R2 added (automatic / behind the scenes) Unicode Compression (but Compression is only available in Enterprise Edition). Item 5 is absurdly incorrect. Please see my answer here for details: http://stackoverflow.com/a/32871477/577765 – Solomon Rutzky Sep 30 '15 at 17:36

14 Answers14

51

You match the datatype with the data that will be stored in the column. By a similar argument you could say why not store all data in NVARCHAR columns, because numbers and dates can be represented as strings of digits.

If the best match for the data that will be stored in the column is VARCHAR, then use it.

dkretz
  • 37,399
  • 13
  • 80
  • 138
42

Point 4 doesn't matter because storage space is extremely inexpensive.

it is not just storage, but bandwidth - cpu, memory, backup, recovery, transfer. Conserve.

Sam
  • 7,543
  • 7
  • 48
  • 62
  • Links to this in my answer here: http://stackoverflow.com/questions/35366/varchar-vs-nvarchar-performance/198753#198753 – gbn Sep 08 '11 at 17:52
  • Storage as "NVARCHAR" in a database doesn't mean the data is sent "over the wire" as UCS-2 encoded Unicode. It may go over the wire AND be represented in application memory as UTF-8... which is "one-byte per char most of the time". – Dan H Aug 29 '12 at 03:15
27

I'd say that there are still valid reasons to not use nvarchar.

  • Storage space is at a premium, such as on a shared host or the database is really huge.
  • Performance is critical.
  • Brownfield development (i.e. the database has existing tables that use varchar).
  • You are integrating with another older system that only understands single byte characters and/or varchar.

However new development should probably use nvarchar esp. since 64-bit systems are becoming the norm. Also, companies (even small ones) are now more commonly global.

Booji Boy
  • 4,522
  • 4
  • 40
  • 45
  • What does 64 bit have to do with nvarchar? – Jeremy May 09 '11 at 22:38
  • 2
    double-wide chars take up twice as much memory, but this is much less of a concern on 64-bit systems, since they can address much more RAM than 32-bit systems. 32-bit SQL Server on 32-bit Windows (still fairly common in '08) could only use 2 GB RAM (w/o jumping through hoops) – Booji Boy May 10 '11 at 12:36
19

You should choose VARCHAR over NVARCHAR for many different types of columns, and the choice would be on a per-column basis.

Typical columns which would not require the extra overhead NVARCHAR incurs would be:

ID-type columns: License plates, SSNs, Patient Chart identifiers etc.

Code columns: International currency codes (USD, UKP, etc.), ISO country codes (US, UK, etc), Language codes (en-us, etc), accounting segment codes, etc

Postal code and zip code columns.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
11

I believe that comparison of nvarchars is more costly than varchars so it's perfectly valid and even preferred in places where you really don't need unicode capabilities, i.e., for some internal IDs.

And storage cost still does matter. If you have billions of rows then those "small" differences get big pretty fast.

PiRX
  • 3,515
  • 1
  • 19
  • 18
5

As others have pointed out, it's not just the cost of the storage.

The length of a column will affect the number of rows per page. Having fewer rows per page means that fewer can fit into your caches, which drops performance. I am assuming that in MSSQL, a NVARCHAR column which is indexed will use up more space in the index. Which means fewer index entries per block, therefore more blocks in the index, therefore more seeks when scanning (or searching) indexes, which slows down indexed access too.

So it loses you performance on every single front. If you genuinely don't care (or can measure the performance and are happy with it, of course), then that's fine. But if you have a genuine requirement to store unicode characters, of course, use NVARCHAR.

I may be that the maintainability gained by using NVARCHAR throughout your database outweighs any performance cost.

MarkR
  • 62,604
  • 14
  • 116
  • 151
5

These sorts of questions always have the same answer: it depends. There is no magical rule that you should follow blindly. Even the use of GOTO in modern programming languages can be justified: Is it ever advantageous to use 'goto' in a language that supports loops and functions? If so, why?

So the answer is: use your head and think about the particular situation. In this particular instance keep in mind that you can always convert from varchar to nvarchar in the database if it turns out your requirements change.

Community
  • 1
  • 1
Brannon
  • 25,687
  • 5
  • 39
  • 44
4

I have seen nvarchar columns converted to varchar for two reasons:

  1. Application is using MSSQL Express Edition, which has 4GB database size limit. Switching to MSSQL Standard Edition would be too expensive if there are many database deployments, as would be in single-tenant webapps or applications with embedded DBMS. The cheaper SQL2008 Web Edition could help here.

  2. nvarchar(4000) is not enough but you don't want an ntext column. So you convert to varchar(8000). However, in most cases you probably should convert to nvarchar(max).

mika
  • 6,812
  • 4
  • 35
  • 38
3

Your point 3 is invalid. Systems that are designed only for a single country's use don't have to worry about unicode, and some languages/products in use don't support unicode either at all or only partially. For example, TurboTax is only for the U.S. (and even with a Canadian version with French is still just LATIN-1), so they wouldn't need or have to worry about unicode and probably don't support it (I don't know if they do or not, but even if they do, it's just an example).

"Today's applications should always be Unicode compatible."

is probably more valid expressed as:

"Today's applications should always be Unicode compatible if nothing special needs to occur to handle Unicode properly, and a previously existing codebase or any other piece of the application does not need to be updated specifically to support it"

MetroidFan2002
  • 29,217
  • 16
  • 62
  • 80
  • 1
    I think I would always give more weight to the pain of a potential upgrade to Unicode than the pain of perhaps using too much storage space. – Edward Tanguay Nov 23 '08 at 06:29
  • @Edward, that would be a business decision rather than a technical one. Our company (and it's BIG) still ships some English-only database-using applications because that's our desired market. – paxdiablo Nov 23 '08 at 07:28
  • 1
    As member of non-English speaking country (yes there are quite some of them out there), where language contains diacritics I can say, that applications should be Unicode compatible. – PiRX Nov 23 '08 at 07:35
  • 1
    I don't see why TurboTax has to be English-only, even if it were US only... – Neil Williams Nov 23 '08 at 09:53
  • LATIN-1 handles French and Spanish as well, so there is very little use for Unicode in that product. And PiRX, Unicode may very well be the easiest to use with your situation, but if there was something like LATIN-1 for your common languages, then Unicode would not be required either. – MetroidFan2002 Nov 24 '08 at 01:22
  • 2
    That seems like a flawed assumption. What happens if your US-only application has to store the name of a US citizen born in another country, with a non-ASCII character in their name? It is silly to assume that just because the app is only used in the US or maybe Canada, no unicode will be needed – jalf Jan 23 '09 at 03:05
  • Then obviously that character won't be inputted. To turn your statement around, it is silly to assume that just because there may be infrequent outliers that existing products and applications should be architected to address extremes instead of likelihoods. In this case, it would be YAGNI. – MetroidFan2002 Jan 28 '09 at 21:28
  • Why would I add Unicode support to an application used by two people in my office? 50 hours of work, 0 benefit. Not everyone makes boxed software. – Jeff Davis May 12 '10 at 13:53
2

My leaning is "use NVARCHAR" as a default... but @CadeRoux has a good point: if you are SURE the data will never hold anything but ASCII -- like a US license plate -- VARCHAR might save you a tiny bit of cost.

I'd say the flip side of his well-put statement is "DO use NVARCHAR" for anything that will have names (people, streets, places) or natural language text (email, chat, articles, blog postings, photo captions). Otherwise, your "firstname" column will not be able to encode "François" or "José" correctly, and your text columns will not allow text with "foreign" diacritcal marks, or -- for that matter -- very common US characters like cent-mark "¢", paragraph mark "¶", a bullet "•". (Because none of those are ASCII characters, and there is no good, standard way to put them in to a VARCHAR field. Trust me: you'll hurt yourself.)

On ANY project I've worked on, I've NEVER been scolded for using NVARCHAR because I was "squandering too much company money on disk space". And if I had to rework code or the DB schema (especially on a live, production system), the cost spent in the re-fit would EASILY outweigh the "savings" from buying a disk that was 50% smaller.

To really understand this question you really have to understand ASCII, Unicode, and Unicode's typical encodings (like UCS-2 and UTF-8).

Dan H
  • 14,044
  • 6
  • 39
  • 32
  • An NVARCHAR(12) will take 24-bytes, and can hold any 12 characters in the BMP, or any 6 characters outside it. An 8-bit-transparent VARCHAR(24), used with suitable access methods, could hold 24 ASCII characters, any 12 characters in the lowest part of the BMP, any 8 in the BMP, or any 6 outside the BMP; alternatively, it could be used to hold any combination of 8 characters using an encoding which stores 3 bytes per character, e.g. with the MSB set on the first byte of each triplet set and cleared on the other two. – supercat Dec 13 '14 at 20:32
2

Storage is less expensive than it's ever been historically, but still if you can store twice as much data on a given hard drive, that's attractive, isn't it?

Also there's RAM for caching, and solid-state drives, which are both a lot more expensive than hard drives. It's beneficial to use more compact data formats when you have millions of rows.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
2

Is there a way for your database server to use UTF-8 as an encoding? You then get the benefits of low storage for mostly ASCII loads, and the ability to store anything in the range of Unicode so that expansion is possible.

I would ask your database vendor to support UTF-8 as an encoding for the VARCHAR SQL type, as well. I don't know how other DB servers do it, but I do know that you can use UTF-8 in VARCHAR and TEXT fields in at least MySQL and PostgreSQL.

All that having been said though, the only reason to not use a UTF-16 encoded field is if you have to interact with applications which will break on UTF-16 input. This would be most legacy applications which were designed to handle ASCII or ISO-8815 text encodings, which would be better off processing UTF-8.

Michael Trausch
  • 3,187
  • 1
  • 21
  • 29
  • MS SQL Server doesn't support UTF8. It uses UCS-2, which is roughly equivalent to UTF-16 for characters in the basic multilingual plane (BMP). I don't know if a hack exists to get it to support UTF-8, but I doubt it. – Triynko Apr 25 '11 at 20:10
  • In that case, it'd likely be best to wrap access to the database with a character set converter, so that it will return UTF-8 values to the application and send UTF-16 values to the database. At least, that's what I'd do if I were working with a system where I had to worry about character encoding myself. If you do not have to worry about character encoding yourself (e.g., Python 3 or something that takes care of it transparently) then I suppose it doesn't truly matter... – Michael Trausch May 24 '11 at 14:57
1

I'm no expert on the subject. But any reason why you couldn't use UTF-8 to get a combination of small space and unicode?

Evan Teran
  • 87,561
  • 32
  • 179
  • 238
  • Microsoft SQL Server (at least 2000 and 2005) doesn't support storing character data in UTF-8. – Mike Henry Nov 23 '08 at 06:05
  • Isn't UTF-8 pretty much just ASCII anyways? – Booji Boy Nov 23 '08 at 06:20
  • Only for code points that fall within the ASCII range - otherwise it is completely different – 1800 INFORMATION Nov 23 '08 at 06:21
  • 1
    To be exact, UTF-8 code points 0-127 are identical to ASCII. – Powerlord Nov 24 '08 at 05:50
  • A database COULD use UTF-8 as a way to store Unicode. I don't see why it would be a performance hit (for searching, let's say). However, the question is about NVARCHAR... which is what MS SQL calls a UCS-2 Unicode string (which is a fixed two-bytes per Unicode character). – Dan H Aug 29 '12 at 03:32
  • @MikeHenry: What does it do with characters where the high bit of a byte is set? One of the design criteria for UTF-8 was that the only way the sequence of bytes which represents a legitimate UTF-8 string can appear within the sequence bytes for another string, is if the second string contains the first. – supercat Dec 13 '14 at 20:27
  • @BoojiBoy No, UTF-8, UCS-2/UTF-16, and UTF-32 are just 3 different ways of encoding the same spectrum of Unicode code points (over 1 million addressable, currently just over 120k mapped). ASCII (proper) is just values 0 - 127 and ASCII Extended (via Code Pages) holds those plus a varying set of characters for values 128 - 255. – Solomon Rutzky Sep 30 '15 at 17:16
  • @DanH Some databases can use UTF-8, but SQL Server stores all Unicode data in UCS-2/UTF-16. This covers the XML and `N`-prefixed types. VARCHAR is specifically 8-bit ASCII (but can do 3 different Double-Byte Character Sets). This is different than Oracle which has NVARCHAR as "Unicode-_only_" and VARCHAR as having the ability to do UTF-8. – Solomon Rutzky Sep 30 '15 at 17:19
1

I've seen some database where the indices (indexes?...different debate) have been larger than the data. If one can get away with half the storage demands (varchar) within the index then one assumes that equates to twice the hit density of a given page and more efficient fill-factoring leading to faster data retrieval/writing/locking & less storage requirements (already mentioned).