This issue has nothing to do with UTF-8 given that the data being transmitted, as shown in the transmission packet (bottom of the question) is UTF-16 Little Endian (just as SQL Server would be expecting). And it is perfectly good UTF-16LE, all except for the missing final byte, just like the error message implies.
The problem is most likely a minor bug in freetds that incorrectly applies logic meant to strip off trailing spaces from variable length string fields. There are no trailing spaces, you say? Well, if it hadn't gotten chopped off then it would be a little clearer (but, if it hadn't gotten chopped off there wouldn't be this error). So, let's look at what the packet to see if we can reconstruct it.
The error in the data is probably being overlooked because the packet contains an even number of bytes. But not all fields are double-byte, so it doesn't need to be an even number. If we know what the good data is (prior to the error), then we can find a starting point in the data and move forwards. It is best to start with Ţ
as it will hopefully be above the 255 / FF value and hence take 2 bytes. Anything below will have a 00
and many of the characters have that on both sides. While we should be able to assume Little Endian encoding, it is best to know for certain. To that end, we need at least one character that has two non-00
bytes, and bytes that are different (one of the character is 01
for both bytes and that does not help determine ordering). The first character of this string field, Ţ
, confirms this as it is Code Point 0162 yet shows up as 62 01
in the packet.
Below are the characters, in the same order as the packet, their UTF-16 LE values, and a link to their full details. The first character's byte sequence of 62 01
gives us our starting point, and so we can ignore the initial 00 13 00
of line 0040
(they have been removed in the copy below for readability). Please note that the "translation" shown to the right does not interpret Unicode, so the 2-byte sequence of 62 01
is displayed as 62
by itself (i.e. lower-case Latin "b") and 01
by itself (i.e. non-printable character; displayed as ".").
0040 xx xx xx 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|
0050 00 66 00 01 01 18 ?? - |.f....|
As you can see, the last character is really 18 20
(i.e. a byte-swapped 20 18
due to the Little Endian encoding), not 01 18
as it might appear if reading the packet starting at the end. Somehow, the final byte -- hex 20
-- is missing, hence the Unicode data is odd byte size
error.
Now, 20
by itself, or followed by 00
, is a space. This would explain why @GordThompson was able to get it working by adding an additional character to the end (the final character was no longer trimmable). This could be further proven by ending with another character that is a U+20xx Code Point. For example, if I am correct about this, then ending with ⁄
-- Fraction Slash U+2044 -- would have the same error, while ending with ⅄
-- Turned Sans-Serif Capital Y U+2144 -- even with the ‘
just before it, should work just fine (@GordThompson was kind enough to prove that ending with ⅄
did work, and that ending with ⁄
resulted the same error).
If the input file is null
(i.e. 00
) terminated, then it could simply be the 20 00
ending sequence that does it, in which case ending with a newline might fix it. This can also be proven by testing a file with two lines: line 1 is the existing row from bad.txt, and line 2 is a line that should work. For example:
291054 Ţawī Rifā‘
999999 test row, yo!
If the two-line file shown directly above works, that proves that it is the combination of a U+20xx Code Point and that Code Point being the last character (of the transmission more than of the file) that exposes the bug. BUT, if this two-line file also gets the error, then it proves that having a U+20xx Code Point as the last character of a string field is the issue (and it would be reasonable to assume that this error would happen even if the string field were not the final field of the row, since the null terminator for the transmission has already been ruled out in this case).
It seems like either this is a bug with freetds / freebcp, or perhaps there is a configuration option to not have it attempt trimming trailing spaces, or maybe a way to get it to see this field as being NCHAR
instead of NVARCHAR
.
UPDATE
Both @GordThompson and the O.P. (@NeilMcGuigan) have tested and confirmed that this issue exists regardless of where the string field is in the file: in the middle of a row, at the end of the row, on the last row, and not on the last row. Hence it is a general issue.
And in fact, I found the source code and it makes sense that the issue would happen since there is no consideration for multi-byte character sets. I will file an Issue on the GitHub repository. The source for the rtrim
function is here:
https://github.com/FreeTDS/freetds/blob/master/src/dblib/bcp.c#L2267
Regarding this statement:
The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)
From an encoding stand-point, there is really no difference between UCS-2 and UTF-16. The byte sequences are identical. The only difference is in the interpretation of Surrogate Pairs (i.e. Code Points above U+FFFF / 65535). UCS-2 has the Code Points used to construct Surrogate Pairs reserved, but there was no implementation at that time of any Surrogate Pairs. UTF-16 simply added the implementation of the Surrogate Pairs in order to create Supplementary Characters. Hence, SQL Server stores and retrieves UTF-16 LE data without a problem. The only issue is that the built-in functions don't know how to interpret Surrogate Pairs unless the Collation ends with _SC
(for Supplementary Characters), and those Collations were introduced in SQL Server 2012.