248

I'm working on a database for a small web app at my school using SQL Server 2005.
I see a couple of schools of thought on the issue of varchar vs nvarchar:

  1. Use varchar unless you deal with a lot of internationalized data, then use nvarchar.
  2. Just use nvarchar for everything.

I'm beginning to see the merits of view 2. I know that nvarchar does take up twice as much space, but that isn't necessarily a huge deal since this is only going to store data for a few hundred students. To me it seems like it would be easiest not to worry about it and just allow everything to use nvarchar. Or is there something I'm missing?

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
Jason Baker
  • 192,085
  • 135
  • 376
  • 510
  • 6
    reference much more extensive thread which came to the opposite conclusion. http://stackoverflow.com/questions/312170/is-varchar-like-totally-1990s – dkretz Jan 30 '09 at 01:04
  • similar question here: http://stackoverflow.com/questions/312170/is-varchar-like-totally-1990s EDIT by le dorfier: which interestingly came to exactly the opposite conclusion. – Booji Boy Jan 30 '09 at 00:52
  • 3
    Jason: I hope this isn't an inappropriate request, but can you please consider changing the accepted answer to [gbn's](https://stackoverflow.com/a/198753/577765). JoeBarone's answer is horribly wrong for many reasons. Having it be "accepted" misleads novices into making bad choices. It is unnecessary and wasteful to "always use `NVARCHAR`", and it can have very negative impacts on performance and hardware costs / budgets. A few rows, even a few thousand, won't matter. But systems grow more rapidly than people expect, so the current accepted answer is a disservice to the community. Thank you. – Solomon Rutzky Jun 28 '18 at 02:37

14 Answers14

236

Disk space is not the issue... but memory and performance will be. Double the page reads, double index size, strange LIKE and = constant behaviour etc

Do you need to store Chinese etc script? Yes or no...

And from MS BOL "Storage and Performance Effects of Unicode"

Edit:

Recent SO question highlighting how bad nvarchar performance can be...

SQL Server uses high CPU when searching inside nvarchar strings

Community
  • 1
  • 1
gbn
  • 422,506
  • 82
  • 585
  • 676
  • 20
    +1, if your app goes international, you'll have many other issues to worry about that a search/replace to nvarchar: multilingual text/messages, time zones, units of measure and currency – KM. Jun 04 '09 at 18:08
  • 2
    But what if you need to store a foreign name sometimes, like José or Bjørn? – Qwertie Jan 27 '12 at 20:41
  • 7
    @Qwertie: then you use nvarchar. What you don't do it use it unnecessarily. Those 2 names fit into varchar anyway IIRC – gbn Jan 27 '12 at 21:15
  • @Qwertie & gbn - Correct These are covered by the Latin dictionary, they work for languages across fine in Europe, its other languages such as Arabic, mandarin that need to use unicode – Chris Wood Jan 21 '14 at 15:32
  • 1
    @gbn I do not understand this `strange LIKE and = constant behaviour` part. Could you add some details what you mean by that? – surfmuggle Jan 24 '14 at 09:17
  • 14
    Saying disk space is not an issue is not true for everyone. We've naively used nvarchar unnecessarily in a large banking application with billions of records stored over many years. With expensive SAN based storage with replication, backup and disaster recovery this can actually translate to millions of dollars in costs for nvarchar vs varchar. Not to mention there is a large (100%) performance impact to have to read twice as many bytes from disk for every read. – codemonkey Aug 21 '14 at 18:37
  • 3
    @codemonkey , et al: I did what I could to address the issue of wasted space holistically in the following article: [Disk Is Cheap! ORLY?](http://www.sqlservercentral.com/articles/data-modeling/71725/) (free registration is required, though). The article is intended to help prevent the situation that codemonkey ran into regarding expensive, enterprise-level storage. – Solomon Rutzky Sep 30 '15 at 15:07
  • 2
    @surfmuggle He means that not using N'string' format could cause performance issues. Now every query needs to be correct based on the type and you have to keep track of them all. Not a huge deal, but a potential gotcha. – PRMan Jul 24 '20 at 18:12
  • I had a field that was used for doing equals string lookups and changing to VARCHAR made a HUGE difference in performance (MS-SQL) when using C# Linq. – shawad Jan 07 '22 at 17:22
132

Always use nvarchar.

You may never need the double-byte characters for most applications. However, if you need to support double-byte languages and you only have single-byte support in your database schema it's really expensive to go back and modify throughout your application.

The cost of migrating one application from varchar to nvarchar will be much more than the little bit of extra disk space you'll use in most applications.

Joe Barone
  • 3,112
  • 3
  • 24
  • 20
  • 4
    it is far harder to go back and add support for multilingual text/messages, time zones, units of measure and currency, so everyone MUST always code these in their application from day one, ALWAYS (even if it is only on your home page web app)! – KM. Jun 04 '09 at 18:08
  • 87
    What about index size, memory usage etc? I assume you always use int when you could use tinyint too "just in case"? – gbn Jul 09 '10 at 07:02
  • @KM the point of yours doesn't make sense to the question and answer but still something very important to consider. – Eduardo Xavier Sep 04 '10 at 13:58
  • 3
    Adding support for Unicode is not that difficult if .NET is in use. It could be as simple as changing the varchar columns to nvarchar, since you won't lose data converting from a code page to Unicode. Existing application code may involve character constraints that you'd simply have to relax or remove. .NET character data is already Unicode, and SQL parameter types are often automatically determined or are converted implicitly by the database to the column type. SQL query string constants may need to be prepended with an 'N', and some string encoding calls may need to use UTF8 or Unicode. – Triynko Apr 25 '11 at 20:01
  • 107
    Always coding/planning for a multi-lingual site (when you have no inkling that you will ever need it) is like telling all young adult they should buy a big 8 seat, gas-guzzling SUV for their first car...after all, they might get married some day and might have 6 kids, . I'd rather enjoy the performance and efficiency while I can and pay the price for the upgrade when/if I need it. – E.J. Brennan Jun 26 '11 at 12:54
  • @gbn your point on always using int instead of tinyint isn't the same comparison to using nvarchar vs varchar. If you have a field for, say, status registers (1=active, 0=deleted, etc) and you KNOW you would never have more than 256 values, then tinyint would be fine. But do you ever KNOW that your app won't use Arabic characters? Maybe, maybe not. But I assume not and use nvarchar. – cbmeeks Mar 07 '12 at 15:22
  • 4
    @cbmeeks: I *don't* code for what I *don't* know. But if you can use it with no noticeable performance hit, then your databases aren't big enough for it to matter... – gbn Mar 07 '12 at 15:48
  • 77
    Usually when people start their answer with the word "Always" then you should ignore everything that comes after that. (Notice I started that statement out with the word "usually" :) – Brandon Moore May 04 '12 at 20:07
  • 5
    NVARCHAR is Unicode stored as UCS2, which SQL Server uses 2x the storage space for as compared to just plain VARCHAR. This means you can cache half as many rows in memory, and your storage IO takes twice as long. – Peter Gfader Dec 03 '14 at 13:38
  • 7
    This is dreadfully horrible advice. _Always_ use `NVARCHAR`? You don't work for EMC or Oracle, do you? ;-) – Solomon Rutzky Sep 30 '15 at 15:18
  • 3
    Terrible advice. If you don't anticipate needing double byte support why pick up the storage and processing overheads. – davehay Aug 09 '17 at 06:39
  • 13
    I know this is old, but this is _bad advice_ and should **NOT** be the accepted answer. – shuniar Jun 20 '18 at 14:59
  • Clearly, Microsoft has messed up by allowing us to still use varchar whenever we like. What are they thinking?!.... – Lee Taylor Nov 05 '19 at 20:56
62

Be consistent! JOIN-ing a VARCHAR to NVARCHAR has a big performance hit.

Thomas Harlan
  • 645
  • 5
  • 2
  • 124
    If you're doing joins on character fields then your database probably has worse issues than whether to use nvarchar or varchar, generally speaking. – Brandon Moore May 04 '12 at 20:02
  • @Thomas Harlan A simple test demonstrates to me that there is no tangible difference between joining `nvarchar` to `varchar` vs converting `nvarchar` to `varchar` and joining to `varchar`. Unless of course you meant being consistent in column datatypes, not in joining. – ajeh Apr 12 '17 at 18:53
  • 2
    @ajeh and Thomas: 1) "simple" tests are often misleading as they don't cover variations that cause differences in behavior. 2) If one sees a drastic performance hit when mixing `VARCHAR` and `NVARCHAR`, that should be due to indexing of the `VARCHAR` column along with the type of Collation used for that column (and hence the index). I cover this topic in detail in the following blog post: [Impact on Indexes When Mixing VARCHAR and NVARCHAR Types](http://sqlquantumleap.com/2017/07/10/impact-on-indexes-when-mixing-varchar-and-nvarchar-types/). – Solomon Rutzky Aug 05 '17 at 16:16
  • Data point on joins: we have 3 views under 10k rows that perform fine with these joins, but adding view with 22k rows query now takes 2min, the 10k views are under 5sec. The domain is financial (equities), and changing stock symbols for surrogate keys would be an even bigger hit. – yzorg Dec 03 '21 at 18:30
  • There seems to be a "tipping point" when the type coercion is pushed to all rows, and query time falls off a cliff. – yzorg Dec 03 '21 at 18:31
49

nvarchar is going to have significant overhead in memory, storage, working set and indexing, so if the specs dictate that it really will never be necessary, don't bother.

I would not have a hard and fast "always nvarchar" rule because it can be a complete waste in many situations - particularly ETL from ASCII/EBCDIC or identifiers and code columns which are often keys and foreign keys.

On the other hand, there are plenty of cases of columns, where I would be sure to ask this question early and if I didn't get a hard and fast answer immediately, I would make the column nvarchar.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
48

I hesitate to add yet another answer here as there are already quite a few, but a few points need to be made that have either not been made or not been made clearly.

First: Do not always use NVARCHAR. That is a very dangerous, and often costly, attitude / approach. And it is no better to say "Never use cursors" since they are sometimes the most efficient means of solving a particular problem, and the common work-around of doing a WHILE loop will almost always be slower than a properly done Cursor.

The only time you should use the term "always" is when advising to "always do what is best for the situation". Granted that is often difficult to determine, especially when trying to balance short-term gains in development time (manager: "we need this feature -- that you didn't know about until just now -- a week ago!") with long-term maintenance costs (manager who initially pressured team to complete a 3-month project in a 3-week sprint: "why are we having these performance problems? How could we have possibly done X which has no flexibility? We can't afford a sprint or two to fix this. What can we get done in a week so we can get back to our priority items? And we definitely need to spend more time in design so this doesn't keep happening!").

Second: @gbn's answer touches on some very important points to consider when making certain data modeling decisions when the path isn't 100% clear. But there is even more to consider:

  • size of transaction log files
  • time it takes to replicate (if using replication)
  • time it takes to ETL (if ETLing)
  • time it takes to ship logs to a remote system and restore (if using Log Shipping)
  • size of backups
  • length of time it takes to complete the backup
  • length of time it takes to do a restore (this might be important some day ;-)
  • size needed for tempdb
  • performance of triggers (for inserted and deleted tables that are stored in tempdb)
  • performance of row versioning (if using SNAPSHOT ISOLATION, since the version store is in tempdb)
  • ability to get new disk space when the CFO says that they just spent $1 million on a SAN last year and so they will not authorize another $250k for additional storage
  • length of time it takes to do INSERT and UPDATE operations
  • length of time it takes to do index maintenance
  • etc, etc, etc.

Wasting space has a huge cascade effect on the entire system. I wrote an article going into explicit detail on this topic: Disk Is Cheap! ORLY? (free registration required; sorry I don't control that policy).

Third: While some answers are incorrectly focusing on the "this is a small app" aspect, and some are correctly suggesting to "use what is appropriate", none of the answers have provided real guidance to the O.P. An important detail mentioned in the Question is that this is a web page for their school. Great! So we can suggest that:

  • Fields for Student and/or Faculty names should probably be NVARCHAR since, over time, it is only getting more likely that names from other cultures will be showing up in those places.
  • But for street address and city names? The purpose of the app was not stated (it would have been helpful) but assuming the address records, if any, pertain to just to a particular geographical region (i.e. a single language / culture), then use VARCHAR with the appropriate Code Page (which is determined from the Collation of the field).
  • If storing State and/or Country ISO codes (no need to store INT / TINYINT since ISO codes are fixed length, human readable, and well, standard :) use CHAR(2) for two letter codes and CHAR(3) if using 3 letter codes. And consider using a binary Collation such as Latin1_General_100_BIN2.
  • If storing postal codes (i.e. zip codes), use VARCHAR since it is an international standard to never use any letter outside of A-Z. And yes, still use VARCHAR even if only storing US zip codes and not INT since zip codes are not numbers, they are strings, and some of them have a leading "0". And consider using a binary Collation such as Latin1_General_100_BIN2.
  • If storing email addresses and/or URLs, use NVARCHAR since both of those can now contain Unicode characters.
  • and so on....

Fourth: Now that you have NVARCHAR data taking up twice as much space than it needs to for data that fits nicely into VARCHAR ("fits nicely" = doesn't turn into "?") and somehow, as if by magic, the application did grow and now there are millions of records in at least one of these fields where most rows are standard ASCII but some contain Unicode characters so you have to keep NVARCHAR, consider the following:

  1. If you are using SQL Server 2008 - 2016 RTM and are on Enterprise Edition, OR if using SQL Server 2016 SP1 (which made Data Compression available in all editions) or newer, then you can enable Data Compression. Data Compression can (but won't "always") compress Unicode data in NCHAR and NVARCHAR fields. The determining factors are:

  2. NCHAR(1 - 4000) and NVARCHAR(1 - 4000) use the Standard Compression Scheme for Unicode, but only starting in SQL Server 2008 R2, AND only for IN ROW data, not OVERFLOW! This appears to be better than the regular ROW / PAGE compression algorithm.

  3. NVARCHAR(MAX) and XML (and I guess also VARBINARY(MAX), TEXT, and NTEXT) data that is IN ROW (not off row in LOB or OVERFLOW pages) can at least be PAGE compressed, but not ROW compressed. Of course, PAGE compression depends on size of the in-row value: I tested with VARCHAR(MAX) and saw that 6000 character/byte rows would not compress, but 4000 character/byte rows did.

  4. Any OFF ROW data, LOB or OVERLOW = No Compression For You!

  5. If using SQL Server 2005, or 2008 - 2016 RTM and not on Enterprise Edition, you can have two fields: one VARCHAR and one NVARCHAR. For example, let's say you are storing URLs which are mostly all base ASCII characters (values 0 - 127) and hence fit into VARCHAR, but sometimes have Unicode characters. Your schema can include the following 3 fields:

       ...
       URLa VARCHAR(2048) NULL,
       URLu NVARCHAR(2048) NULL,
       URL AS (ISNULL(CONVERT(NVARCHAR([URLa])), [URLu])),
       CONSTRAINT [CK_TableName_OneUrlMax] CHECK (
                         ([URLa] IS NOT NULL OR [URLu] IS NOT NULL)
                     AND ([URLa] IS NULL OR [URLu] IS NULL))
     );
    

    In this model you only SELECT from the [URL] computed column. For inserting and updating, you determine which field to use by seeing if converting alters the incoming value, which has to be of NVARCHAR type:

     INSERT INTO TableName (..., URLa, URLu)
     VALUES (...,
             IIF (CONVERT(VARCHAR(2048), @URL) = @URL, @URL, NULL),
             IIF (CONVERT(VARCHAR(2048), @URL) <> @URL, NULL, @URL)
            );
    
  6. You can GZIP incoming values into VARBINARY(MAX) and then unzip on the way out:

    • For SQL Server 2005 - 2014: you can use SQLCLR. SQL# (a SQLCLR library that I wrote) comes with Util_GZip and Util_GUnzip in the Free version
    • For SQL Server 2016 and newer: you can use the built-in COMPRESS and DECOMPRESS functions, which are also GZip.
  7. If using SQL Server 2017 or newer, you can look into making the table a Clustered Columnstore Index.

  8. While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • 14
    **Slow Clap.** Simply amazed that "always use nvarchar" got 140 votes and this did not. Great work on this post. – schizoid04 Nov 23 '17 at 07:20
  • 1
    @schizoid04 Thanks. To be fair, the accepted answer was posted 7 years before mine, so there's a lot of traffic that voted on it (and/or various others) that never came back to reevaluate. Still, it provides a very solid counterpoint to the "wisdom of the crowd" theory that drives vote-based forums. There's too much misinformation out there. For example, [this](https://dba.stackexchange.com/a/191619/30859) on DBA.SE. The other answer, accepted before I posted mine, is "correct" by the narrowest of definitions, misleading, and contains info that I disprove in mine, yet it still outpaces mine. – Solomon Rutzky Nov 24 '17 at 18:46
  • 1
    This is a very, very good post except for one thing. I highly doubt that they are going to allow names in Chinese, Arabic or Sanskrit text anytime soon. A lot of people think accents make something Unicode. They don't. – PRMan Jul 24 '20 at 19:45
23

For your application, nvarchar is fine because the database size is small. Saying "always use nvarchar" is a vast oversimplification. If you're not required to store things like Kanji or other crazy characters, use VARCHAR, it'll use a lot less space. My predecessor at my current job designed something using NVARCHAR when it wasn't needed. We recently switched it to VARCHAR and saved 15 GB on just that table (it was highly written to). Furthermore, if you then have an index on that table and you want to include that column or make a composite index, you've just made your index file size larger.

Just be thoughtful in your decision; in SQL development and data definitions there seems to rarely be a "default answer" (other than avoid cursors at all costs, of course).

WebMasterP
  • 518
  • 4
  • 8
11

Since your application is small, there is essentially no appreciable cost increase to using nvarchar over varchar, and you save yourself potential headaches down the road if you have a need to store unicode data.

tbreffni
  • 5,082
  • 5
  • 31
  • 30
8

Generally speaking; Start out with the most expensive datatype that has the least constraints. Put it in production. If performance starts to be an issue, find out what's actually being stored in those nvarchar columns. Is there any characters in there that wouldn't fit into varchar? If not, switch to varchar. Don't try to pre-optimize before you know where the pain is. My guess is that the choice between nvarchar/varchar is not what's going to slow down your application in the foreseable future. There will be other parts of the application where performance tuning will give you much more bang for the bucks.

Kjetil Klaussen
  • 6,266
  • 1
  • 35
  • 29
  • Yikes. I really don't like this approach from an application developer's perspective. If code is written to expect Type A and you change it to Type B you have to do testing top to bottom all over again. My vote is to do the best you can to identify your expected data and go from there. – mateoc15 Feb 24 '22 at 15:45
7

For that last few years all of our projects have used NVARCHAR for everything, since all of these projects are multilingual. Imported data from external sources (e.g. an ASCII file, etc.) is up-converted to Unicode before being inserted into the database.

I've yet to encounter any performance-related issues from the larger indexes, etc. The indexes do use more memory, but memory is cheap.

Whether you use stored procedures or construct SQL on the fly ensure that all string constants are prefixed with N (e.g. SET @foo = N'Hello world.';) so the constant is also Unicode. This avoids any string type conversion at runtime.

YMMV.

devstuff
  • 8,277
  • 1
  • 27
  • 33
  • 4
    You probably don't have several hundred million records in the tables you're working with. I agree that for most apps defaulting to nvarchar is fine, but not all. – Brandon Moore May 04 '12 at 19:57
7

I can speak from experience on this, beware of nvarchar. Unless you absolutely require it this data field type destroys performance on larger database. I inherited a database that was hurting in terms of performance and space. We were able to reduce a 30GB database in size by 70%! There were some other modifications made to help with performance but I'm sure the varchar's helped out significantly with that as well. If your database has the potential for growing tables to a million + records stay away from nvarchar at all costs.

Marcel Gosselin
  • 4,610
  • 2
  • 31
  • 54
J.A
  • 79
  • 1
  • 1
4

I deal with this question at work often:

  • FTP feeds of inventory and pricing - Item descriptions and other text were in nvarchar when varchar worked fine. Converting these to varchar reduced file size almost in half and really helped with uploads.

  • The above scenario worked fine until someone put a special character in the item description (maybe trademark, can't remember)

I still do not use nvarchar every time over varchar. If there is any doubt or potential for special characters, I use nvarchar. I find I use varchar mostly when I am in 100% control of what is populating the field.

K Richard
  • 1,924
  • 2
  • 22
  • 43
3

Why, in all this discussion, has there been no mention of UTF-8? Being able to store the full unicode span of characters does not mean one has to always allocate two-bytes-per-character (or "code point" to use the UNICODE term). All of ASCII is UTF-8. Does SQL Server check for VARCHAR() fields that the text is strict ASCII (i.e. top byte bit zero)? I would hope not.

If then you want to store unicode and want compatibility with older ASCII-only applications, I would think using VARCHAR() and UTF-8 would be the magic bullet: It only uses more space when it needs to.

For those of you unfamiliar with UTF-8, might I recommend a primer.

Tevya
  • 31
  • 2
  • 2
    What you are suggesting might work for some applications, but one must also consider the impact of an extra encoding layer on the way SQL text is processed. In particular, collations, searching, and pattern matching will be effected. And if reports are run against the database, standard reporting tools will not interperate the multi-byte characters correctly. And bulk imports and exports may be effected. I think that—over the long term—this scheme may be more trouble than it’s worth. – Jeffrey L Whitledge Dec 10 '09 at 17:58
  • 1
    It is not possible to store UTF-8 in VARCHAR columns. MSSQL will *always* convert your UTF-8 data to the column collation. If you mess up the collation (like trying to store CP1252 in Latin_1) the conversion will not work and you'll end up with extra bytes in your data. It may *appear* to work fine when you convert latin_1 to UTF-8 (on the app side) and back again to latin_1 (db side) but it is just an illusion. You can sneak by the DB auto converting to your column collation by using freetds and setting the protocol to something less than 7, but you lose the ability to query nvarchar. – chugadie Aug 29 '13 at 16:10
  • 1
    @chugadie and Tevya: this answer is a bit non-nonsensical. SQL Server only uses UCS-2/UTF-16 to store Unicode data (i.e. XML and `N`-prefixed types). You don't get a choice of using UTF-8. Also, Unicode encodings (UTF-8, UCS-2/UTF-16, and UTF-32) can't be applied to VARCHAR fields. – Solomon Rutzky Sep 30 '15 at 15:34
2

There'll be exceptional instances when you'll want to deliberately restrict the data type to ensure it doesn't contain characters from a certain set. For example, I had a scenario where I needed to store the domain name in a database. Internationalisation for domain names wasn't reliable at the time so it was better to restrict the input at the base level, and help to avoid any potential issues.

Chris Halcrow
  • 28,994
  • 18
  • 176
  • 206
1

If you are using NVARCHAR just because a system stored procedure requires it, the most frequent occurrence being inexplicably sp_executesql, and your dynamic SQL is very long, you would be better off from performance perspective doing all string manipulations (concatenation, replacement etc.) in VARCHAR then converting the end result to NVARCHAR and feeding it into the proc parameter. So no, do not always use NVARCHAR!

ajeh
  • 2,652
  • 2
  • 34
  • 65