How to write UTF-8 characters using bulk insert in SQL Server?

Question

I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.

My first suspect was the last line of the format file:

10.0
3
1 SQLCHAR  0  0  "{|}"  1 INSTANCEID ""
2 SQLCHAR  0  0  "{|}"  2 PROPERTYID ""
3 SQLCHAR  0  0  "[|]"  3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"

But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.

What is the solution to this problem or at least a workaround?

score 63 · Answer 1 · edited Jul 10 '19 at 14:46

I came here before looking for a solution for bulk inserting special characters. Didn't like the workaround with UTF-16 (that would double the size of csv file). I found out that you definitely CAN and it's very easy, you don't need a format file. This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is: just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.

e.g.:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = 'ACP',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

With SQL Server 2014, I get an error when trying to use bulk insert with codepage 65001: `The code page 65001 is not supported by the server.`. Either way the data I need to import is in Arabic, so I wasn't expecting this to work anyway. — easuter, Mar 26 '15 at 16:18
code page 65001 (UTF-8 encoding) is supported in SQL2016, and SQL2014 SP2 — Kristen, May 14 '17 at 13:45
There's a related hotfix for SQL 2014 here: https://support.microsoft.com/en-gb/help/3136780/utf-8-encoding-support-for-the-bcp-utility-and-bulk-insert-transact-sq - i.e. this is the fix within SP2 mentioned by Kristen above. — JohnLBevan, May 21 '18 at 10:36

score 33 · Accepted Answer · answered Jul 27 '11 at 06:48

33

You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.

answered Jul 27 '11 at 06:48

Michael-O

18,123
6
55
121

I was afraid of that. Thanks for confirmation – Goran Jovic Jul 27 '11 at 17:02
2

I had some serious trouble while setting up a data warehouse with SQL Server 2008 and Analysis Services last year. I wanted to insert a huge CSV file into the database with bulk insert and after hours of trying, I realized that the database knows only [Unicode BMP](http://msdn.microsoft.com/en-us/library/ms186939%28v=SQL.100%29.aspx) which is a subset of UTF-16. So my entire file had to be recoded with `iconv` in Unix first, then the import went smoothly. See [here](http://msdn.microsoft.com/en-us/library/ms188365.aspx) arguments => CODEPAGE, code_page and DATAFILETYPE, widenative – Michael-O Jul 27 '11 at 17:19
That's pretty much my requirement. Thanks! You saved me a lot of time – Goran Jovic Jul 28 '11 at 09:00
7

code page 65001 (UTF-8 encoding) is supported in SQL2016, and SQL2014 SP2 (also mentioned in other answers, but they are well down the page and might be missed) – Kristen May 14 '17 at 13:45
I have sql 2016, I open the file in notepad ++ , in tab Encoding I see what's the encoding that was recognized. e.g. Windows-1255 so the code is: `WITH ( CODEPAGE = '1255', FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW=2 ), ` works perfect! – AJ AJ Mar 17 '21 at 10:55

score 11 · Answer 3 · edited Mar 17 '23 at 02:58

11

In excel save file as CSV(Comma delimited)
Open saved CSV file in notepad++
Encoding -> Convert tO UCS-2 Big Endian
Save

BULK INSERT #tmpData
FROM 'C:\Book2.csv'
WITH
(
    FIRSTROW = 2,
    FIELDTERMINATOR = ';',  --CSV field delimiter
    ROWTERMINATOR = '\n',   --Use to shift the control to next row
    TABLOCK
);

Done.

edited Mar 17 '23 at 02:58

Delfino

967
4
21
46

answered Jan 07 '15 at 08:51

Janis Rudovskis

193
3
9

It this case you should have all strings as **nvarchar** – beloblotskiy May 20 '15 at 21:47
I don't think it has to be big endian, it can be little endian instead. So either of the utf-16 encodings are supported. Sidepoints- as you probably know, (though for the sake of any readers),you don't need TABLOCK and CSVs are normally "," as field separator. – barlop Mar 29 '18 at 10:51

score 7 · Answer 4 · answered Jul 12 '16 at 16:36

7

Microsoft just added UTF-8 support to SQL Server 2014 SP2:

https://support.microsoft.com/en-us/kb/3136780

answered Jul 12 '16 at 16:36

Jon Morisi

101
1
1

score 4 · Answer 5 · answered Apr 19 '11 at 11:16

4

You can re-encode the data file with UTF-16. That's what I did anyway.

answered Apr 19 '11 at 11:16

sayap

6,169
2
36
40

3

Are you saying that it handles UTF-16 encoding properly but not UTF-8. If that's the only problem, it's cool. – Goran Jovic Apr 19 '11 at 11:55

score 2 · Answer 6 · edited Jun 26 '15 at 20:34

2

Use these options - DATAFILETYPE='char' and CODEPAGE = '1252'

edited Jun 26 '15 at 20:34

Atanu Roy

1,384
2
17
29

answered Jun 26 '15 at 19:20

Heber

21
1

1

A brief explanation would improve this answer. – Jun 27 '15 at 02:16

score 2 · Answer 7 · answered Jul 19 '16 at 21:58

2

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT (as was part of the original question), and OPENROWSET.

answered Jul 19 '16 at 21:58

Charles Burns

10,310
7
64
81

score 1 · Answer 8 · answered Mar 15 '17 at 19:04

Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.

bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.

Using Java we wrote the file using this code:

PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");

This gave us the correct data to insert.

We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:

UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:

bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w

When this is all done you get some sweet looking data!

You language codes are wrong. As per ISO standard, language codes are always lowercase, but country codes are uppercase. — Michael-O, Jun 12 '17 at 08:40

score 1 · Answer 9 · answered Mar 31 '11 at 10:34

1

Shouldn't you be using SQLNCHAR instead of SQLCHAR for the unicode data?

answered Mar 31 '11 at 10:34

AdaTheDev

142,592
28
206
200

score 1 · Answer 10 · answered Nov 14 '19 at 19:45

Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ ) I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.

Format File codepages are having priority over the -C attribute

Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem

13.0
4
1       SQLCHAR             0       7       ","      1     PersonID               ""
2       SQLCHAR             0       25      ","      2     FirstName              SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       30      ","      3     LastName               ""
4       SQLCHAR             0       11      "\r\n"   4     BirthDate              ""

score 0 · Answer 11 · answered Jan 27 '15 at 11:57

0

I managed to do this using SSIS and a ADO NET destination instead of OLEDB.

answered Jan 27 '15 at 11:57

JYatesDBA

25
1
10

score 0 · Answer 12 · answered May 20 '15 at 21:42

My exported data are in TSV format from DB which has Latin-1 encoding.

This easy to check: SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;

Extract file is in UTF-8 format.

BULK INSERT isn't working with UTF-8, so I convert UTF-8 to ISO-8859-1 (aka Latin-1) with simple Clojure script:

(spit ".\\dump\\file1.txt" (slurp ".\\dump\\file1_utf8.txt" :encoding "UTF-8") :encoding "ISO-8859-1")

To execute - correct paths and java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj

score 0 · Answer 13 · edited Aug 12 '15 at 08:16

0

I have tested the bulk insertion with UTF -8 Format. It works fine in Sql Server 2012.

string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
                                      FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
                                      WITH ( FIELDTERMINATOR =  '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
                                      EXEC SP_EXECUTESQL @BulkInsertQuery";

I was using *.DAT file with FS as column separator.

edited Aug 12 '15 at 08:16

Irshad

3,071
5
30
51

answered Aug 12 '15 at 07:44

suryanarayan baliarsingh

1

Where did you specify the UTF-8 encoding in your code (e. g. codepage 65001)? Or did you just upload an UTF-8 file without specifying the codepage? – R Yoda Feb 03 '16 at 07:29

How to write UTF-8 characters using bulk insert in SQL Server?

13 Answers13

Linked

Related