24

I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.

My first suspect was the last line of the format file:

10.0
3
1 SQLCHAR  0  0  "{|}"  1 INSTANCEID ""
2 SQLCHAR  0  0  "{|}"  2 PROPERTYID ""
3 SQLCHAR  0  0  "[|]"  3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"

But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.

What is the solution to this problem or at least a workaround?

Goran Jovic
  • 9,418
  • 3
  • 43
  • 75

13 Answers13

63

I came here before looking for a solution for bulk inserting special characters. Didn't like the workaround with UTF-16 (that would double the size of csv file). I found out that you definitely CAN and it's very easy, you don't need a format file. This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is: just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.

e.g.:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = 'ACP',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );
user3071284
  • 6,955
  • 6
  • 43
  • 57
Tom-K
  • 631
  • 5
  • 3
  • 8
    With SQL Server 2014, I get an error when trying to use bulk insert with codepage 65001: `The code page 65001 is not supported by the server.`. Either way the data I need to import is in Arabic, so I wasn't expecting this to work anyway. – easuter Mar 26 '15 at 16:18
  • 1
    code page 65001 (UTF-8 encoding) is supported in SQL2016, and SQL2014 SP2 – Kristen May 14 '17 at 13:45
  • There's a related hotfix for SQL 2014 here: https://support.microsoft.com/en-gb/help/3136780/utf-8-encoding-support-for-the-bcp-utility-and-bulk-insert-transact-sq - i.e. this is the fix within SP2 mentioned by Kristen above. – JohnLBevan May 21 '18 at 10:36
33

You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.

Michael-O
  • 18,123
  • 6
  • 55
  • 121
  • I was afraid of that. Thanks for confirmation – Goran Jovic Jul 27 '11 at 17:02
  • 2
    I had some serious trouble while setting up a data warehouse with SQL Server 2008 and Analysis Services last year. I wanted to insert a huge CSV file into the database with bulk insert and after hours of trying, I realized that the database knows only [Unicode BMP](http://msdn.microsoft.com/en-us/library/ms186939%28v=SQL.100%29.aspx) which is a subset of UTF-16. So my entire file had to be recoded with `iconv` in Unix first, then the import went smoothly. See [here](http://msdn.microsoft.com/en-us/library/ms188365.aspx) arguments => CODEPAGE, code_page and DATAFILETYPE, widenative – Michael-O Jul 27 '11 at 17:19
  • That's pretty much my requirement. Thanks! You saved me a lot of time – Goran Jovic Jul 28 '11 at 09:00
  • 7
    code page 65001 (UTF-8 encoding) is supported in SQL2016, and SQL2014 SP2 (also mentioned in other answers, but they are well down the page and might be missed) – Kristen May 14 '17 at 13:45
  • I have sql 2016, I open the file in notepad ++ , in tab Encoding I see what's the encoding that was recognized. e.g. Windows-1255 so the code is: `WITH ( CODEPAGE = '1255', FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW=2 ), ` works perfect! – AJ AJ Mar 17 '21 at 10:55
11
  1. In excel save file as CSV(Comma delimited)
  2. Open saved CSV file in notepad++
  3. Encoding -> Convert tO UCS-2 Big Endian
  4. Save
BULK INSERT #tmpData
FROM 'C:\Book2.csv'
WITH
(
    FIRSTROW = 2,
    FIELDTERMINATOR = ';',  --CSV field delimiter
    ROWTERMINATOR = '\n',   --Use to shift the control to next row
    TABLOCK
);

Done.

Delfino
  • 967
  • 4
  • 21
  • 46
Janis Rudovskis
  • 193
  • 3
  • 9
  • It this case you should have all strings as **nvarchar** – beloblotskiy May 20 '15 at 21:47
  • I don't think it has to be big endian, it can be little endian instead. So either of the utf-16 encodings are supported. Sidepoints- as you probably know, (though for the sake of any readers),you don't need TABLOCK and CSVs are normally "," as field separator. – barlop Mar 29 '18 at 10:51
7

Microsoft just added UTF-8 support to SQL Server 2014 SP2:

https://support.microsoft.com/en-us/kb/3136780

Jon Morisi
  • 101
  • 1
  • 1
4

You can re-encode the data file with UTF-16. That's what I did anyway.

sayap
  • 6,169
  • 2
  • 36
  • 40
  • 3
    Are you saying that it handles UTF-16 encoding properly but not UTF-8. If that's the only problem, it's cool. – Goran Jovic Apr 19 '11 at 11:55
2

Use these options - DATAFILETYPE='char' and CODEPAGE = '1252'

Atanu Roy
  • 1,384
  • 2
  • 17
  • 29
Heber
  • 21
  • 1
2

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT (as was part of the original question), and OPENROWSET.

Charles Burns
  • 10,310
  • 7
  • 64
  • 81
1

Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.

bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.

Using Java we wrote the file using this code:

PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");

This gave us the correct data to insert.

utf-16 little-endian

We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:

UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:

bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w

When this is all done you get some sweet looking data!

Good little endian!

markthegrea
  • 3,731
  • 7
  • 55
  • 78
  • 1
    You language codes are wrong. As per ISO standard, language codes are always lowercase, but country codes are uppercase. – Michael-O Jun 12 '17 at 08:40
1

Shouldn't you be using SQLNCHAR instead of SQLCHAR for the unicode data?

AdaTheDev
  • 142,592
  • 28
  • 206
  • 200
1

Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ ) I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.

Format File codepages are having priority over the -C attribute

Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem

13.0
4
1       SQLCHAR             0       7       ","      1     PersonID               ""
2       SQLCHAR             0       25      ","      2     FirstName              SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       30      ","      3     LastName               ""
4       SQLCHAR             0       11      "\r\n"   4     BirthDate              ""
Martin Lietz
  • 113
  • 2
  • 8
0

I managed to do this using SSIS and a ADO NET destination instead of OLEDB.

JYatesDBA
  • 25
  • 1
  • 10
0

My exported data are in TSV format from DB which has Latin-1 encoding.

This easy to check: SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;

Extract file is in UTF-8 format.

BULK INSERT isn't working with UTF-8, so I convert UTF-8 to ISO-8859-1 (aka Latin-1) with simple Clojure script:

(spit ".\\dump\\file1.txt" (slurp ".\\dump\\file1_utf8.txt" :encoding "UTF-8") :encoding "ISO-8859-1")

To execute - correct paths and java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj

beloblotskiy
  • 948
  • 9
  • 7
0

I have tested the bulk insertion with UTF -8 Format. It works fine in Sql Server 2012.

string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
                                      FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
                                      WITH ( FIELDTERMINATOR =  '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
                                      EXEC SP_EXECUTESQL @BulkInsertQuery";

I was using *.DAT file with FS as column separator.

Irshad
  • 3,071
  • 5
  • 30
  • 51
  • Where did you specify the UTF-8 encoding in your code (e. g. codepage 65001)? Or did you just upload an UTF-8 file without specifying the codepage? – R Yoda Feb 03 '16 at 07:29