Controlling SQL Servers best-fit unicode transformation

Question

A recent whitehat scan made me aware of SQL Server's best fit unicode transformations. This means that when a string containing unicode characters is converted to a non-unicode string, SQL Server will do a best-fit replacement on the characters it can in order to not trash your data with question marks. For example:

SELECT 'ŤĘŞŤ'

Outputs "TEST"

Each character is replaced with a "similar" ASCII equivalent. This can also be seen on a single character where unicode character 65308 (＜) is converted into ASCII character 60 (<).

SELECT ascii(NCHAR(65308))

Outputs "60"

The main question, is where the heck is this documented? I have Googled for all sorts of phrases and read Microsoft docs, but all I can find are people looking to do manual conversions and nothing that documents SQL Server's apparent automatic best fit unicode transformations. Furthermore, can this be turned off or configured?

While the behavior is convenient for apps that do not store strings as unicode and probably goes completely noticed in most scenarios, penetration tests report this as a "high" vuln since unicode transformations can be used to circumvent validation routines and lead to vulns such as XSS.

A friend pointed out this which I did not find in my initial Googling. It helps with the "why" but not with the documentation or modification of the behavior. http://dba.stackexchange.com/questions/76781/converting-a-unicode-value-to-a-non-unicode-value-sql-server — Brad Wood, Sep 22 '15 at 01:39
This starts to dig pretty deep into the "how" and "why" of collation. It'll make you pull your hair out. Plus, simply changing the collation on your database can potentially change what SELECT ascii(NCHAR(65308)) returns for you. The OWASP preso in your link is a pretty good one for the "why". It talks about SQL Smuggling SQLi with homoglyphic transformations and such, and it demonstrates how sneaky some people can be to Bobby Tables you. Or worse. http://www.it-docs.net/ddata/4954.pdf << The whitepaper that preso came from. I wonder if CF/Java's canonicalize operations catch these. — Shawn, Sep 22 '15 at 02:53
Looks like it might be a little deeper than I thought. This goes all the way to the Unicode specification (http://unicode.org/) and the different vendors methods of supporting that spec. And now I know more about http://www.unicode.org/faq/normalization.html than I ever knew I wanted to know. :-) — Shawn, Sep 22 '15 at 03:22
Shawn, CFML's canonicalize() function (from the ESAPI lib) does not affect Unicode characters since they are already in their canonical form, meaning they are not escaped in any way. — Brad Wood, Sep 22 '15 at 04:06
What encoding is that database using? You definitely want to make sure if you insert `'ŤĘŞŤ'` the database actually stores `'ŤĘŞŤ'` (not `'????'`, or some approximation like `'TEST'`). — roeland, Sep 22 '15 at 06:39
Roeland, the default collation is SQL_Latin1_General_CP1_CI_AS, but please read the post again. The approximation is what's being stored, but that's not always the desirable behavior. — Brad Wood, Sep 22 '15 at 16:35
Sorry, Brad. I wasn't clear and was really more just thinking out loud. I know canonicalize() breaks things down to their simplest forms and that Unicode characters are already a simple form, but I was more wondering aloud about how the secondary operations of Java and CF's canonicalize() would handle these in double/multi encoding. If nothing else, this really shows 1) why you should ALWAYS validate your input and 2) why blacklisting is hopeless and whitelisting is a better way to go. :-) — Shawn, Sep 22 '15 at 22:07
Hmmm... http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/reference/DefaultEncoder.html#canonicalize(java.lang.String, boolean, boolean) http://owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/org/owasp/esapi/Encoder.html — Shawn, Sep 22 '15 at 22:07
And back to your original question, I still think this gets into the default implementation of Unicode on your Windows server and how it handles homoglyphic transformations. I'm honestly not sure how or where you could change that. — Shawn, Sep 22 '15 at 22:08
@Shawn and Brad: please see my answer for the location of the complete list of transformations. I see that Brad referenced the question on DBA.SE that I originally documented this in, but it seems that Brad looked at the question in Sept and I answered a few months later ;-). And no, I can't imagine that these mappings are configurable, at least not without editing some system DLLs. — Solomon Rutzky, Feb 24 '16 at 19:09

score 5 · Accepted Answer · edited Apr 13 '17 at 12:42

^{(the following is an excerpt from my answer to the related question on DBA.StackExchange: Automatic Translation when Converting Unicode to non-Unicode / NVARCHAR to VARCHAR)}

These "best fit" mappings are documented, just not in the easiest of places to find. If you go to the following URL you will see a list of several files, each one named for the Code Page that it maps Unicode characters to:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

Most of the files were last updated (or at least placed there) on 2006-10-04, and one of them was updated on 2012-03-14. The first part of those files maps ASCII codes into an equivalent Unicode Code Point. But the second part of each file maps the Unicode characters into their ASCII "equivalents".

I wrote a test script that uses the Code Page 1252 mappings to check if SQL Server is truly using those mappings. That can be determined by answering these two questions:

For all mapped Code Points, does SQL Server convert them into the specified mappings ?
For all unmapped Code Points, does SQL Server convert any of them into a non-"?" character?

The test script is too long to place here, so I posted it on Pastebin at:

Unicode to Code Page mappings in SQL Server

Running the script will show that the answer to the first question above is "Yes" (meaning that all of the provided mappings are adhered to). It will also show that the answer to the second question is "No" (meaning, none of the unmapped Code Points convert into anything but the character for "unknown"). Hence, that mapping file is very accurate :-).

Furthermore, can this be turned off or configured?

I do not believe so, but that doesn't mean it is impossible to do one or both. HOWEVER, it should be noted that these mappings are "Microsoft" mappings, and hence work with Windows and SQL Server; they are not SQL Server-specific. So, even if it is possible to find where this stuff is configured, it would probably be a bad idea to change since it would effect everything running on the OS.

Controlling SQL Servers best-fit unicode transformation

1 Answers1