A recent whitehat scan made me aware of SQL Server's best fit unicode transformations. This means that when a string containing unicode characters is converted to a non-unicode string, SQL Server will do a best-fit replacement on the characters it can in order to not trash your data with question marks. For example:
SELECT 'ŤĘŞŤ'
Outputs "TEST"
Each character is replaced with a "similar" ASCII equivalent. This can also be seen on a single character where unicode character 65308 (<) is converted into ASCII character 60 (<).
SELECT ascii(NCHAR(65308))
Outputs "60"
The main question, is where the heck is this documented? I have Googled for all sorts of phrases and read Microsoft docs, but all I can find are people looking to do manual conversions and nothing that documents SQL Server's apparent automatic best fit unicode transformations. Furthermore, can this be turned off or configured?
While the behavior is convenient for apps that do not store strings as unicode and probably goes completely noticed in most scenarios, penetration tests report this as a "high" vuln since unicode transformations can be used to circumvent validation routines and lead to vulns such as XSS.