13

This file works fine (UTF-8):

$ cat ok.txt
291054  Ţawī Rifā

This file causes an error (UTF-8):

$ cat bad.txt
291054  Ţawī Rifā‘

Here's the message:

$ freebcp 'DB.dbo.table' in bad.txt ... -c
Starting copy...
Msg 20050, Level 4
Attempt to convert data stopped by syntax error in source field

Msg 4895, Level 16, State 2
Server '...', Line 1
    Unicode data is odd byte size for column 2. Should be even byte size.
Msg 20018, Level 16
General SQL Server error: Check messages from the SQL Server

The only difference is the last character, which is unicode 2018 (left single quotation mark)

Any idea what is causing this error?

The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)

The column in question is nvarchar(200)

Here's the packet sent right before the error:

packet.c:741:Sending packet
0000 07 01 00 56 00 00 01 00-81 02 00 00 00 00 00 08 |...V.... ........|
0010 00 38 09 67 00 65 00 6f-00 6e 00 61 00 6d 00 65 |.8.g.e.o .n.a.m.e|
0020 00 69 00 64 00 00 00 00-00 09 00 e7 90 01 09 04 |.i.d.... ...ç....|
0030 d0 00 34 04 6e 00 61 00-6d 00 65 00 d1 ee 70 04 |Ð.4.n.a. m.e.Ñîp.|
0040 00 13 00 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|
0050 00 66 00 01 01 18      -                        |.f....|
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152

4 Answers4

12

Update: This issue has apparently been fixed in FreeTDS v1.00.16, released 2016-11-04.


I can reproduce your issue using FreeTDS v1.00.15. It definitely looks like a bug in freebcp that causes it to fail when the last character of a text field has a Unicode code point of the form U+20xx. (Thanks to @srutzky for correcting my conclusion as to the cause.) As you noted, this works ...

291054  Ţawī Rifā

... and this fails ...

291054  Ţawī Rifā‘

... but I found that this also works:

291054  Ţawī Rifā‘x

So, an ugly workaround would be to run a script against your input file that would append a low-order non-space Unicode character to each text field (e.g., x which is U+0078, as in the last example above), use freebcp to upload the data, and then run an UPDATE statement against the imported rows to strip off the extra character.

Personally, I would be inclined to switch from FreeTDS to Microsoft's SQL Server ODBC Driver for Linux, which includes the bcp and sqlcmd utilities when installed using the instructions described here:

https://gallery.technet.microsoft.com/scriptcenter/SQLCMD-and-BCP-for-Ubuntu-c88a28cc

I just tested it under Xubuntu 16.04, and although I had to tweak the procedure a bit to use libssl.so.1.0.0 instead of libssl.so.0.9.8 (and the same for libcrypto), once I got it installed the bcp utility from Microsoft succeeded where freebcp failed.

If the SQL Server ODBC Driver for Linux will not work on a Mac then another alternative would be to use the Microsoft JDBC Driver 6.0 for SQL Server and a little bit of Java code, like this:

connectionUrl = "jdbc:sqlserver://servername:49242"
        + ";databaseName=myDb"
        + ";integratedSecurity=false";
String myUserid = "sa", myPassword = "whatever";

String dataFileSpec = "C:/Users/Gord/Desktop/bad.txt";
try (
        Connection conn = DriverManager.getConnection(connectionUrl, myUserid, myPassword);
        SQLServerBulkCSVFileRecord fileRecord = new SQLServerBulkCSVFileRecord(dataFileSpec, "UTF-8", "\t", false);
        SQLServerBulkCopy bulkCopy = new SQLServerBulkCopy(conn)) {
    fileRecord.addColumnMetadata(1, "col1", java.sql.Types.NVARCHAR, 50, 0);
    fileRecord.addColumnMetadata(2, "col2", java.sql.Types.NVARCHAR, 50, 0);
    bulkCopy.setDestinationTableName("dbo.freebcptest");
    bulkCopy.writeToServer(fileRecord);
} catch (Exception e) {
    e.printStackTrace(System.err);
}
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Switching to Microsoft's SQL Server ODBC Driver for Linux *is* a client-side workaround, in case that was unclear from the answer. – Gord Thompson Nov 02 '16 at 23:52
  • on mac and I don't think that driver works on mac/bsd – Neil McGuigan Nov 03 '16 at 00:18
  • I have added a JDBC option that should work on a Mac with Java. – Gord Thompson Nov 03 '16 at 15:52
  • I'd like to give you a partial bounty for your helpful response. Know how to do that? Thx – Neil McGuigan Nov 03 '16 at 20:58
  • @NeilMcGuigan - I'm not aware of a way to do that but thanks, I appreciate the thought. – Gord Thompson Nov 03 '16 at 22:15
  • @NeilMcGuigan I believe you can start as many bounties as you have points for. For this question, they will always be 500 pts since that is what the last one was and they are supposed to double each time but 500 is the max. You can then award it all to one answer or split any way you like among the answers. But at the very least, +1 from me, for both Gord's time in helping out with the testing (thanks again :), as well as the suggestion of using "Microsoft's SQL Server ODBC Driver for Linux" which will likely help others. – Solomon Rutzky Nov 04 '16 at 05:05
9

This issue has nothing to do with UTF-8 given that the data being transmitted, as shown in the transmission packet (bottom of the question) is UTF-16 Little Endian (just as SQL Server would be expecting). And it is perfectly good UTF-16LE, all except for the missing final byte, just like the error message implies.

The problem is most likely a minor bug in freetds that incorrectly applies logic meant to strip off trailing spaces from variable length string fields. There are no trailing spaces, you say? Well, if it hadn't gotten chopped off then it would be a little clearer (but, if it hadn't gotten chopped off there wouldn't be this error). So, let's look at what the packet to see if we can reconstruct it.

The error in the data is probably being overlooked because the packet contains an even number of bytes. But not all fields are double-byte, so it doesn't need to be an even number. If we know what the good data is (prior to the error), then we can find a starting point in the data and move forwards. It is best to start with Ţ as it will hopefully be above the 255 / FF value and hence take 2 bytes. Anything below will have a 00 and many of the characters have that on both sides. While we should be able to assume Little Endian encoding, it is best to know for certain. To that end, we need at least one character that has two non-00 bytes, and bytes that are different (one of the character is 01 for both bytes and that does not help determine ordering). The first character of this string field, Ţ, confirms this as it is Code Point 0162 yet shows up as 62 01 in the packet.

Below are the characters, in the same order as the packet, their UTF-16 LE values, and a link to their full details. The first character's byte sequence of 62 01 gives us our starting point, and so we can ignore the initial 00 13 00 of line 0040 (they have been removed in the copy below for readability). Please note that the "translation" shown to the right does not interpret Unicode, so the 2-byte sequence of 62 01 is displayed as 62 by itself (i.e. lower-case Latin "b") and 01 by itself (i.e. non-printable character; displayed as ".").

0040 xx xx xx 62 01 61 00 77-00 2b 01 20 00 52 00 69 |...b.a.w .+. .R.i|  
0050 00 66 00 01 01 18 ??   -                        |.f....|

As you can see, the last character is really 18 20 (i.e. a byte-swapped 20 18 due to the Little Endian encoding), not 01 18 as it might appear if reading the packet starting at the end. Somehow, the final byte -- hex 20 -- is missing, hence the Unicode data is odd byte size error.

Now, 20 by itself, or followed by 00, is a space. This would explain why @GordThompson was able to get it working by adding an additional character to the end (the final character was no longer trimmable). This could be further proven by ending with another character that is a U+20xx Code Point. For example, if I am correct about this, then ending with -- Fraction Slash U+2044 -- would have the same error, while ending with -- Turned Sans-Serif Capital Y U+2144 -- even with the just before it, should work just fine (@GordThompson was kind enough to prove that ending with did work, and that ending with resulted the same error).

If the input file is null (i.e. 00) terminated, then it could simply be the 20 00 ending sequence that does it, in which case ending with a newline might fix it. This can also be proven by testing a file with two lines: line 1 is the existing row from bad.txt, and line 2 is a line that should work. For example:

291054  Ţawī Rifā‘
999999  test row, yo!

If the two-line file shown directly above works, that proves that it is the combination of a U+20xx Code Point and that Code Point being the last character (of the transmission more than of the file) that exposes the bug. BUT, if this two-line file also gets the error, then it proves that having a U+20xx Code Point as the last character of a string field is the issue (and it would be reasonable to assume that this error would happen even if the string field were not the final field of the row, since the null terminator for the transmission has already been ruled out in this case).

It seems like either this is a bug with freetds / freebcp, or perhaps there is a configuration option to not have it attempt trimming trailing spaces, or maybe a way to get it to see this field as being NCHAR instead of NVARCHAR.

UPDATE

Both @GordThompson and the O.P. (@NeilMcGuigan) have tested and confirmed that this issue exists regardless of where the string field is in the file: in the middle of a row, at the end of the row, on the last row, and not on the last row. Hence it is a general issue.

And in fact, I found the source code and it makes sense that the issue would happen since there is no consideration for multi-byte character sets. I will file an Issue on the GitHub repository. The source for the rtrim function is here:

https://github.com/FreeTDS/freetds/blob/master/src/dblib/bcp.c#L2267


Regarding this statement:

The SQL Server uses UTF-16LE (though TDS starts with UCS-2LE and switches over I believe)

From an encoding stand-point, there is really no difference between UCS-2 and UTF-16. The byte sequences are identical. The only difference is in the interpretation of Surrogate Pairs (i.e. Code Points above U+FFFF / 65535). UCS-2 has the Code Points used to construct Surrogate Pairs reserved, but there was no implementation at that time of any Surrogate Pairs. UTF-16 simply added the implementation of the Surrogate Pairs in order to create Supplementary Characters. Hence, SQL Server stores and retrieves UTF-16 LE data without a problem. The only issue is that the built-in functions don't know how to interpret Surrogate Pairs unless the Collation ends with _SC (for Supplementary Characters), and those Collations were introduced in SQL Server 2012.

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • 1
    I hadn't considered the possible special significance of the missing byte being `20` - good point! – Gord Thompson Nov 03 '16 at 15:54
  • 1
    Confirmed. No error from freebcp when the last character is `⅄` instead of `‘`. – Gord Thompson Nov 03 '16 at 16:15
  • 1
    @GordThompson Excellent! Thank you for testing that :). Did you by any chance test ending with the Fraction Slash -- `⁄` -- to confirm that it also causes the error? – Solomon Rutzky Nov 03 '16 at 16:25
  • 1
    Yes, `⁄` as the last character of the string causes the error, too. You seem to have nailed it! – Gord Thompson Nov 03 '16 at 17:06
  • adding carriage return works too (for bad.txt but not ok.txt) – Neil McGuigan Nov 03 '16 at 19:29
  • @NeilMcGuigan Are you saying that adding a carriage return broke **ok.txt**? Also, what if there is a second line (in **bad.txt**), one that does not end in a character that is a _U+20xx_ Code Point? I am trying to determine if the issue is with it being just hex `20`, or if it is an issue with `20 00` where the `00` is the file terminator. If a second line in **bad.txt** allows it to work, then that proves it is the `00` at the end of the file that is the "perfect storm".. – Solomon Rutzky Nov 03 '16 at 19:31
  • @NeilMcGuigan Please see my update to my answer. I added some details regarding an additional test case to determine if the issue is isolated to the character being the last in the transmission or simply being the last character in a string field. I kinda want to believe that it is isolated to the end of the file as that is a far less frequent situation to have, else I would think this error would affect lots of people. – Solomon Rutzky Nov 03 '16 at 19:50
  • @GordThompson Thank you again for doing that additional testing. I have updated my answer with those results, as well as a new test case to narrow down if the bug is isolated to the character being at the end of the transmission vs the end of any string field. – Solomon Rutzky Nov 03 '16 at 19:51
  • correction, adding a CR works with both files! used `unix2dos` on them. originally ok-dos.txt was empty because I screwed something up so i thought it didn't work. – Neil McGuigan Nov 03 '16 at 20:13
  • @NeilMcGuigan Interesting that the error occurs even when it is _not_ the last line of the file. And thanks for re-testing the CR on **ok.txt** as that was not making much sense that it would break. But, now I don't get why adding a CR worked with **bad.txt**, but adding another line did not. Can you re-confirm that the 2nd line added to **bad.txt** and ending with a good character gets the same error? Also, what version of FreeTDS are you using? – Solomon Rutzky Nov 03 '16 at 20:17
  • adding "999 yo" to bad.txt does not help. unix2dos seems to fix everything. freetds docs say \n is the default though which is weird – Neil McGuigan Nov 03 '16 at 20:20
  • FWIW my tests with FreeTDS v1.00.15 are under Ubuntu, and are quite consistent. The U+20xx character will cause an error if it is the last character of a text field, even if that field is not the last field in the row or the last field in the file. I still get the same error with a second (innocuous) line in the data file. – Gord Thompson Nov 03 '16 at 20:21
  • @NeilMcGuigan Are you saying that with a **bad.txt** file with 2 lines, the 2nd line being "good", it errors _until_ you run it through **unix2dos**? Also, based on what Gord mentioned regarding the error happening even when it was not the final field, can you test that as well? Or does that apply to your situation? I ask because if you might run into these characters being at the end of fields that are not the final field of the row, then I am not sure that **unix2dos** will help (when the field is in the middle of the row). I am _trying_ to look through the source code, but am short on time. – Solomon Rutzky Nov 03 '16 at 20:25
  • fudge. actually if i add another column it goes back being an error... `999 Ţawī Rifā‘ stuff` errors out even after unix2dos. I think I can live w appending a space char to every column that might have this issue though – Neil McGuigan Nov 03 '16 at 20:30
  • @NeilMcGuigan Yeah, I was afraid of that. So, the source code is on [GitHub](https://github.com/FreeTDS/freetds). I think the problem is in their "rtrim" function, which is used in their [reference doc](http://www.freetds.org/reference/a00375.html#ga4d5be57b25a0654e2170fd656dc35489), but I need to actually download it and do a search for that function. If I can find the problem, it is possible that one of us (or Gord perhaps?) can fix it. Or at least I can post an Issue to their GitHub repository. – Solomon Rutzky Nov 03 '16 at 20:44
  • @GordThompson I found the problem in their source code. Please see my update. – Solomon Rutzky Nov 03 '16 at 20:55
  • @NeilMcGuigan I need to do a little more research tomorrow before I post the issue to the FreeTDS GitHub. So far I am leaning towards suggesting that instead of fixing the function to handle UTF-16, they should actually remove it entirely because why are they doing rtrim in the first place? Seems like it could easily lead to data corruption if the trailing spaces were supposed to be there. Just seems a bit presumptuous to do the rtrim as that should be handled on a case-by-case basis, once the data has been loaded. – Solomon Rutzky Nov 04 '16 at 05:10
  • i already posted it, and they already fixed it. amazing! great work. many thanks – Neil McGuigan Nov 04 '16 at 06:23
  • But ya you're right, seems unnecessary in first place – Neil McGuigan Nov 04 '16 at 06:27
  • @NeilMcGuigan Wow, that was fast! I just posted my related issue: https://github.com/FreeTDS/freetds/issues/81 . But at least it is currently working (in the sense of not erroring or silently removing Code Point U+2020 ;-). – Solomon Rutzky Nov 04 '16 at 17:32
2

This might be an encoding issue of the source file.

As you are using non-standard characters, the source file should be unicode by itself probably. Other encodings use a differing count of bytes (one up to three) to encode one single character. E.g. your Unicode 2018 is 0xE2 0x80 0x98 in UTF-8.

Your packet ends with .R.i.f....| while there should be your ā‘. And the error shows Server '...', Line 1.

Try to find out the encoding of your source file (look at big and little endian too) and try to convert your file to a sure unicode format.

Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • Thanks Shnugo. Added some more info to my question. The source is UTF-8. MS SQL Server is UCS2LE/UTF16LE . Freetds is supposed to use iconv internally to convert for you... – Neil McGuigan Aug 16 '16 at 21:32
  • @NeilMcGuigan I don't know your tools... Maybe someone else is jumping in to help. I'd try to convert the source file to another format just to find out, if this is the reason. Good luck! – Shnugo Aug 16 '16 at 21:37
  • @NeilMcGuigan, You did not come back to this, so this might be solved already... If not: You might want to read this: http://stackoverflow.com/q/36402353/5089204 Many tools which **can read and de-code** a *normal* string properly get their problems with characters that need **3 bytes** for their UTF-8 encoding. – Shnugo Aug 22 '16 at 09:58
  • 2
    `00 52` is _not_ utf8; it is probably UCS2 (Unicode) for `R`. However, `00 01 01 18` does not agree with `ā‘`. It is possible, even somewhat straightforward, to convert between UCS2 and utf8, but it is not possible to mix them in the same string. – Rick James Sep 09 '16 at 22:52
1

This might solve it:

inf your /etc/freetds/freetds.conf

add:

client charset = UTF-8

also found this about the flag use utf-16

use utf-16 Instead of using UCS-2 for database wide character encoding use UTF-16. Newer Windows versions use this encoding instead of UCS-2. This could result in some issues if clients assume that a character is always 2 bytes.

O_Z
  • 1,515
  • 9
  • 11