13

French and Spanish have special chars in them that are not used in normal English (accented vowels and such).

Are those chars supported in a varchar? Or do I need a nvarchar for them?

(NOTE: I do NOT want a discussion on if I should use nvarchar or varchar.)

Vaccano
  • 78,325
  • 149
  • 468
  • 850

5 Answers5

16

What SQL Implementation(s) are you talking about?

I can speak about Microsoft Sql Server; other SQL implementations, not so much.

For Microsoft SQL Server, the default collation is SQL_Latin1_General_CP1_CI_AS (Latin 1 General, case-preserving, case-insensitive, accent-sensitive). It allows the round-trip representation of most western European languages in single-byte form (varchar) rather than double-byte form (nvarchar).

It's built on the "Windows 1252" code page. That code page is effectively ISO-8859-1 with the code point range 0x80–0x9F being represented by an alternate set of glyphs, including the Euro symbol at 0x80. ISO-8859-1 specifies that code point range as control characters, which have no graphical representation.

ISO-8859-1 consists of the first 256 characters of Unicodes Basic Multilinigual Plane, covering the entire domain of an 8-bit character (0x00–0xFF). For details and comparison see

Western European languages that will have a hard time with this collating sequence include (but aren't necessarily limited to) Latvian, Lithuanian, Polich, Czech and Slovak. If you need to support those, you'll either need to use a different collation (SQL Server offers a plethora of collations), or move to using nvarchar.

One should note that mixing collations within a database tends to cause problems. Deviating from the default collation should be done only when necessary and with an understanding of how you can shoot yourself in the foot with it.

I suspect Oracle and DB2 provide similar support. I don't know about MySQL or other implementations.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • This statement is wrong: “the "Windows 1252" code page, which is ISO-8859-1”. No it’s not! **Windows-1252 is not ISO-8859-1. Windows-1252 is not ISO-8859-1. Windows-1252 is not ISO-8859-1.** That which I tell you three times is true. – tchrist Aug 25 '11 at 13:05
  • 2
    Then perhaps you should actually **read** the sentence as written: *It's built on the "Windows 1252" code page, **which is ISO-8859-1, with the code point range 0x80-0x9F...replaced by alternate glyphs**.* Which is a true statement. That is the delta between ISO-8859-1 and Windows 1252. – Nicholas Carey Aug 25 '11 at 16:50
  • 2
    This is the “Eat Shoots and Leaves” problem: a misplaced comma changes the sense of the sentence altogether. By putting the comma in front of the *with*, you changed it into something it is not. It’s just like the difference between “I’ll ask my friend, who [just happens] to eat brocolli” and “I’ll ask my friend [the one] who eats brocolli.” See? It would be a lot clearer if it read “It’s built on Windows-1252, which is like ISO-8859-1 except that the codepoint range…” Note that “code page” is a foreign term to many of us and certainly carries no added information, and so should be omitted. – tchrist Aug 25 '11 at 17:32
  • It’s also very evil to pretend that Latin1 means Windows-1252. It does not! **Latin1 is an IANA-registered alias for ISO-8859-1**, *not* for Windows-1252. This is just another foul example of Microsoft hoodwinking its users by corrupting established formal standards and calling things by misleadingly inappropriate — and incorrect — names. – tchrist Aug 25 '11 at 17:35
  • 5
    I think you need to lay off any more caffiene this morning. Please take your editorializing elsewhere. – Nicholas Carey Aug 25 '11 at 18:41
  • Perhaps worth mentioning, the default collation can be determined by using `SELECT SERVERPROPERTY('Collation')` – Pooven Aug 23 '12 at 16:33
  • Maybe nitpicking, but Latvian, Lithuanian, Polish, Czech and Slovak aren't Western European languages, and in particular not in the context of character sets. Latvian and Lithuanian are 'North European', Latin-4 / ISO-8859-4, and Polish, Czech and Slovak are 'Central European', Latin-2 / ISO-8859-2. – stevek_mcc Jan 20 '15 at 14:30
5

You have to use nvarchar.

http://theniceweb.com/archives/156

Most of the characters will fit in varchar but some won't, why take the risk.

Related Question

When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?

Community
  • 1
  • 1
msarchet
  • 15,104
  • 2
  • 43
  • 66
5

The characters that can be stored in a varchar field depend entirely on what code page is defined for that particular field. If there are specific characters that you want to store, then you can choose a code page that will store those characters, and it should work. Badly.

My advice is to always use nvarchar to store strings in a SQL database. In fact, I would consider non-Unicode character encodings to be a bug, whether it is in a database or anywhere else.

Your operating system uses Unicode internally (whether Windows, Mac, Linux, or whatever). The JVM and the .NET Framework use Unicode internally. There is simply no point to doing code page conversions every time you query a database. There is no point to doing code page conversions every time you write to a database. Just use an nvarchar column, and your strings will go straight from your application to the database untouched—no character conversion lookups, no fallback encoding error handlers, no wierd characters or unexpected question marks.

By using nvarchar for all of your string data in your databases—and Unicode in general everywhere—you can stop concerning yourself with encodings and focus on the core functionality of your application, now and forever.

Today is the day to abandon legacy character encodings.

Do it for the maintainers who are coming after you. Do it for your children. Do it for yourself.

Jeffrey L Whitledge
  • 58,241
  • 9
  • 71
  • 99
  • Alas, one of the primary reasons for this data base is to talk to a medical system that filters out all Unicode. Because of that, any communications I have with that system need to be in non-unicode. Thus I need to force non-unicode for any data I gather. This is just one of those times that the situation makes unicode un-needed. – Vaccano Aug 25 '11 at 18:04
  • 2
    @Vaccano - Consider this: you could store the data in your system correctly, and apply the filter when communicating with the other (legacy) system. Then when the other system finally upgrades, your system and all of your data will be ready. After all, your application (which uses Unicode internally if it is built with any modern platform) will have *many* more points of communication with your data than the external system will. And furthermore, just for the principle of the thing, people’s names should be spelled correctly! – Jeffrey L Whitledge Aug 26 '11 at 04:47
  • Oh, and another very important point! varchar columns can be set to a variety of code pages. If you decide to stick with varchar against my advice, then make sure the code page you choose matches whatever code page is being used by the other system that you are communicating with. Then at least the errors will be consistent, and you won't be creating any additional errors. If you take my advice and go full-on Unicode everywhere, then at the points of communication with the other system you will have to do an encoding conversion, and it is likewise vital that code page matches the other system. – Jeffrey L Whitledge Aug 26 '11 at 04:52
2

Some excellent information, particularly from Nicholas Carey, but nobody directly gave a yes/no answer to your question...

Yes, you can use varchar to handle a mix of French and Spanish, providing your character set is Windows-1252 (or a similar modern superset of ISO-8859-1 with a few extra characters like the Euro symbol). In SQL Server, the character set is chosen by setting the collation (server-wide, per database or per column): Windows-1252 is used by the *Latin1* collations. In MySQL, Windows-1252 is called Latin1.

Note that if you try to store a character outside the repertoire of the chosen character set, the system may throw an error, or silently munge the character into a similar one from its repertoire. E.g. SQL Server will munge a Polish Ł to a simple L, but throw an error for a Japanese character.

stevek_mcc
  • 360
  • 4
  • 9
2

I'm not sure but one of these collations may fit both Spanish and French, this would have to be researched though.

http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html

Caimen
  • 2,623
  • 2
  • 26
  • 43