Why does SQL Server not behave consistently when dealing with half-emojis?

Question

When an NVARCHAR attribute in a SQL Server database contains an emoji, string functions and operators behave in different ways (default database collation is SQL_Latin1_General_CP1_CI_AS).

Behavior of string functions

Functions like LEFT() and LEN() see the emoji as multiple, separate characters (LEFT will cut the emoji in half and return a partial value).

DECLARE @YourString NVARCHAR(12) = N'Thank you '
SELECT @YourString, LEFT(@YourString, 11), LEN(@YourString)

The return values are "Thank you ", "Thank you �", and 12.

Behavior of operators

Operators such as UNION, INTERSECT, and EXISTS treat both the emoji and the half-emoji as a single, identical value, though they return different results depending on the order the values arrive in.

Behavior of the UNION operator

UNION treats these as identical records (since it only returns one value), but will nevertheless yield a different result depending on the order, with the bottom record being returned.

SELECT @YourString UNION SELECT LEFT(@YourString, 11) returns "Thank you �" (the value from the bottom half)

SELECT LEFT(@YourString, 11) UNION SELECT @YourString returns "Thank you " (the value from the top half)

Behavior of the INTERSECT and EXISTS operators

INTERSECT and EXISTS also treat these as identical values, but will return the value from the top record (which makes sense given the purpose of those operators, but nonetheless feels weird after seeing UNION do the opposite).

SELECT @YourString INTERSECT SELECT LEFT(@YourString, 11) returns "Thank you ".

SELECT LEFT(@YourString, 11) INTERSECT SELECT @YourString returns "Thank you �".

SELECT @YourString EXCEPT SELECT LEFT(@YourString, 11) returns no result.

Summary

String functions such as LEFT() and LEN() treat these characters as separate values.
The UNION operator treats these as identical values, but will nevertheless return varying results (with preference given to the value from the bottom half of the operator)
The EXISTS and INTERSECT operator treats these as identical values, but will return the value from the top half

Question

Why would two different values be treated as a single, identical character by some operators (UNION, INTERSECT, EXCEPT), but be interpreted as two different values by string functions (LEFT, LEN)?

Bonus Question

Why would the UNION operator be returning the second value it sees?

supplementary character (4 bytes)..use a _SC collation (considered as 1 char) `left( N'Thank you ' collate Latin1_General_100_CI_AS_SC, 11)`. https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16#Supplementary_Characters `When using SC collations, the integer_expression parameter counts a UTF-16 surrogate pair as one character ` https://learn.microsoft.com/en-us/sql/t-sql/functions/left-transact-sql?view=sql-server-ver16#arguments — lptr, Nov 18 '22 at 21:10
Generally characters in `nvarchar` take 2 bytes each. `SELECT CAST(N'' AS VARBINARY(10))` returns `0x3DD803DE` so is 4 bytes. See this answer for more about that https://stackoverflow.com/a/63716308/73226. — Martin Smith, Nov 18 '22 at 21:57
@MartinSmith The OP appears to know that, having called them half-emojis. The question is why some operators treat `@YourString` and `LEFT(@YourString, 11)` as equal and some do not. — GSerg, Nov 18 '22 at 21:58
@GSerg - there are multiple questions. The issue as to why `LEFT` counts code points rather than characters is very well addressed in the answer I link — Martin Smith, Nov 18 '22 at 21:59
And the issue around odd equality semantics will be at least partially down to using antiquated collations https://stackoverflow.com/a/47551803/73226 — Martin Smith, Nov 18 '22 at 22:01
@MartinSmith `at least paritially` - no, `@YourString collate Latin1_General_100_CI_AS` fixes all of it. — GSerg, Nov 18 '22 at 22:15
There seems to be some mix-up in your examples: (1) If `INTERSECT` returns a result, doesn't that imply that the values *are* identical? (2) Your second and fourth example SQLs are exactly the same, but you claim that they yield different results. — Heinzi, Nov 18 '22 at 22:22
Indeed, the examples are mixed up. There is no inconsistency. Under `SQL_Latin1_General_CP1_CI_AS` all operators (`=`, `union`, `except`, `intersect`) [see the strings as equal](https://dbfiddle.uk/TUH3PpsE). Under `Latin1_General_100_CI_AS` [they all don't](https://dbfiddle.uk/vSEz-IYk). — GSerg, Nov 18 '22 at 22:43
Thanks to those who pointed out the problem in my sample queries - I'd apparently gone cross-eyed when trying to copy my code into the question and format things correctly. I've revised the question to remove the incorrect examples, and clarified that my question is more about the differences between how string functions (such as `LEFT`) and operators treat the same text. — AHiggins, Nov 21 '22 at 14:56
In reading over the various links and answers provided, the consensus answers to my (revised and clarified questions) seem to be that the collation settings in my database are driving all of the operator behavior (whether they do or do not see the two values as different), but not driving the string function behavior (which, regardless of collation, are seeing the two values as different). If I want the operators to behave the same way as the string functions, I'd need to change collation (with all the various changes that implies) — AHiggins, Nov 21 '22 at 15:01

Heinzi · Answer 1 · 2022-11-19T06:14:05.877

I'm afraid I don't see any inconsistence here: Using SQL_Latin1_General_CP1_CI_AS, all set and comparison operations treat the two values as equal (fiddle):

CREATE TABLE t1 (s1 NVARCHAR(12) COLLATE SQL_Latin1_General_CP1_CI_AS);
CREATE TABLE t2 (s2 NVARCHAR(12) COLLATE SQL_Latin1_General_CP1_CI_AS);
INSERT INTO t1 VALUES (N'Thank you ');
INSERT INTO t2 SELECT LEFT(s1, 11) FROM t1;

SELECT * FROM t1;
SELECT * FROM t2;

SELECT s1 FROM t1 UNION SELECT s2 FROM t2;     -- only s1
SELECT s1 FROM t1 EXCEPT SELECT s2 FROM t2;    -- none
SELECT s2 FROM t2 EXCEPT SELECT s1 FROM t1;    -- none
SELECT s1 FROM t1 INTERSECT SELECT s2 FROM t2; -- only s1

SELECT * FROM t1 INNER JOIN t2 ON s1 = s2;     -- one row with s1 and s2

As soon as we change the collation to Latin1_General_100_CI_AS, all operations treat the two values as not equal (fiddle):

CREATE TABLE t1 (s1 NVARCHAR(12) COLLATE Latin1_General_100_CI_AS);
CREATE TABLE t2 (s2 NVARCHAR(12) COLLATE Latin1_General_100_CI_AS);
INSERT INTO t1 VALUES (N'Thank you ');
INSERT INTO t2 SELECT LEFT(s1, 11) FROM t1;

SELECT * FROM t1;
SELECT * FROM t2;

SELECT s1 FROM t1 UNION SELECT s2 FROM t2;     -- both
SELECT s1 FROM t1 EXCEPT SELECT s2 FROM t2;    -- only s1
SELECT s2 FROM t2 EXCEPT SELECT s1 FROM t1;    -- only s2
SELECT s1 FROM t1 INTERSECT SELECT s2 FROM t2; -- none

SELECT * FROM t1 INNER JOIN t2 ON s1 = s2;     -- no result

The reason why the legacy encoding considers those two values equal can be found in this question:

SQL Query Where Column = '' returning Emoji characters 🎃 and 🍰

You probably already know this, but I'd like to explicitly point out that the fact that a = b evaluates to true for two values with different byte contents and different lengths is in itself not exceptional: All case-insensitive collations treat 'A' and 'a' as equal, and even case-sensitive collations treat 'A' (length 1) and 'A ' (length 2) as equal, since SQL Server ignores trailing spaces.

Very helpful explanation of the operator behavior; I've edited my question based on the feedback to clarify the exact behaviors I described as "inconsistent", since I had originally worded it in a way that was confusing. — AHiggins, Nov 21 '22 at 14:53