2

I'm trying to import a .CSV file into a SQL Server table using a BULK INSERT with a format file. I can get it to import, but any Latin characters are imported as strange characters. I take a lot of pride in completing this personal project on my own, but I've hit a point where I just need help. I can change the characters by doing some messy UPDATE and REPLACE statements after importing the data, but I really want to be able to import the Latin characters as they appear in the .CSV file in one step. Here's the database and table that I created:

CREATE DATABASE Test;

CREATE TABLE dbo.rawData
    ([Position] nvarchar(500) NULL,
    [Const] nvarchar(500) NULL,
    [Created] nvarchar(500) NULL,
    [Modified] nvarchar(500) NULL,
    [Description] nvarchar(500) NULL,
    [Title] nvarchar(500) NOT NULL,
    [TitleType] nvarchar(500) NULL,
    [Directors] nvarchar(500) NULL,
    [YouRated] nvarchar(500) NULL,
    [IMDbRating] nvarchar(500) NULL,
    [Runtime] nvarchar(500) NULL,
    [Year] nvarchar(500) NULL,
    [Genres] nvarchar(500) NULL,
    [NumVotes] nvarchar(500) NULL,
    [ReleaseDate] nvarchar(500) NULL, 
    [URL] nvarchar(500) NULL,
    )
GO

And here is some of the data that I'm working with taken from a .CSV file (saved as ratings.csv). I use Notepad++ and it is encoded in UTF-8. Notice how the last row for "Dallas Buyers Club" has a director with a Latin character in his name:

"position","const","created","modified","description","Title","Title type","Directors","You rated","IMDb Rating","Runtime (mins)","Year","Genres","Num. Votes","Release Date (month/day/year)","URL"
"1","tt0437863","Tue Feb 16 00:00:00 2016","","","The Benchwarmers","Feature Film","Dennis Dugan","5","5.6","80","2006","comedy, romance, sport","39413","2006-04-07","http://www.imdb.com/title/tt0437863/"
"2","tt0085334","Tue Feb 16 00:00:00 2016","","","A Christmas Story","Feature Film","Bob Clark","6","8.1","94","1983","comedy, family","103770","1983-11-18","http://www.imdb.com/title/tt0085334/"
"3","tt2403029","Tue Feb 16 00:00:00 2016","","","The Starving Games","Feature Film","Jason Friedberg, Aaron Seltzer","2","3.3","83","2013","comedy","13719","2013-10-31","http://www.imdb.com/title/tt2403029/"
"4","tt0316465","Tue Feb 16 00:00:00 2016","","","Radio","Feature Film","Michael Tollin","6","6.9","109","2003","biography, drama, sport","31692","2003-10-24","http://www.imdb.com/title/tt0316465/"
"5","tt0141369","Tue Feb 16 00:00:00 2016","","","Inspector Gadget","Feature Film","David Kellogg","4","4.1","78","1999","action, adventure, comedy, family, sci_fi","35340","1999-07-18","http://www.imdb.com/title/tt0141369/"
"6","tt0033563","Tue Feb 16 00:00:00 2016","","","Dumbo","Feature Film","Sam Armstrong, Norman Ferguson","6","7.3","64","1941","animation, family, musical","80737","1941-10-23","http://www.imdb.com/title/tt0033563/"
"7","tt0384642","Tue Feb 16 00:00:00 2016","","","Kicking & Screaming","Feature Film","Jesse Dylan","5","5.5","95","2005","comedy, family, romance, sport","29539","2005-05-01","http://www.imdb.com/title/tt0384642/"
"8","tt0116705","Tue Feb 16 00:00:00 2016","","","Jingle All the Way","Feature Film","Brian Levant","7","5.4","89","1996","comedy, family","66879","1996-11-16","http://www.imdb.com/title/tt0116705/"
"9","tt1981677","Tue Feb 16 00:00:00 2016","","","Pitch Perfect","Feature Film","Jason Moore","7","7.2","112","2012","comedy, music, romance","203205","2012-09-28","http://www.imdb.com/title/tt1981677/"
"10","tt0409459","Tue Feb 16 00:00:00 2016","","","Watchmen","Feature Film","Zack Snyder","7","7.6","162","2009","action, mystery, sci_fi","368137","2009-02-23","http://www.imdb.com/title/tt0409459/"
"11","tt1343092","Tue Feb 16 00:00:00 2016","","","The Great Gatsby","Feature Film","Baz Luhrmann","5","7.3","143","2013","drama, romance","345664","2013-05-01","http://www.imdb.com/title/tt1343092/"
"12","tt0332379","Tue Feb 16 00:00:00 2016","","","School of Rock","Feature Film","Richard Linklater","5","7.1","108","2003","comedy, music","202083","2003-09-09","http://www.imdb.com/title/tt0332379/"
"13","tt0120783","Tue Feb 16 00:00:00 2016","","","The Parent Trap","Feature Film","Nancy Meyers","6","6.4","128","1998","adventure, comedy, drama, family, romance","82087","1998-07-20","http://www.imdb.com/title/tt0120783/"
"14","tt0790636","Tue Feb 16 00:00:00 2016","","","Dallas Buyers Club","Feature Film","Jean-Marc Vallée","7","8.0","117","2013","biography, drama","308118","2013-09-07","http://www.imdb.com/title/tt0790636/"

I have a format file (saved as format.fmt), that when opened in Notepad++ looks like this:

11.0
16
1       SQLCHAR             0       1000    "\",\""    1     Position                   SQL_Latin1_General_CP1_CI_AS
2       SQLCHAR             0       1000    "\",\""    2     Const                      SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       1000    "\",\""    3     Created                    SQL_Latin1_General_CP1_CI_AS
4       SQLCHAR             0       1000    "\",\""    4     Modified                   SQL_Latin1_General_CP1_CI_AS
5       SQLCHAR             0       1000    "\",\""    5     Description                SQL_Latin1_General_CP1_CI_AS
6       SQLCHAR             0       1000    "\",\""    6     Title                      SQL_Latin1_General_CP1_CI_AS
7       SQLCHAR             0       1000    "\",\""    7     TitleType                  SQL_Latin1_General_CP1_CI_AS
8       SQLCHAR             0       1000    "\",\""    8     Directors                  SQL_Latin1_General_CP1_CI_AS
9       SQLCHAR             0       1000    "\",\""    9     YouRated                   SQL_Latin1_General_CP1_CI_AS
10      SQLCHAR             0       1000    "\",\""    10    IMDbRating                 SQL_Latin1_General_CP1_CI_AS
11      SQLCHAR             0       1000    "\",\""    11    Runtime                    SQL_Latin1_General_CP1_CI_AS
12      SQLCHAR             0       1000    "\",\""    12    Year                       SQL_Latin1_General_CP1_CI_AS
13      SQLCHAR             0       1000    "\",\""    13    Genres                     SQL_Latin1_General_CP1_CI_AS
14      SQLCHAR             0       1000    "\",\""    14    NumVotes                   SQL_Latin1_General_CP1_CI_AS
15      SQLCHAR             0       1000    "\",\""    15    ReleaseDate                SQL_Latin1_General_CP1_CI_AS
16      SQLCHAR             0       1000    "\""     16    URL                        SQL_Latin1_General_CP1_CI_AS

When I run the following code, everything imports, however the Latin characters are replaced with a series of strange characters. Here is the code that I'm running:

BULK INSERT [Test].[dbo].[rawData]
FROM 'C:\IMDbRatings\Files\ratings.csv' WITH (FIRSTROW = 2, FORMATFILE= 'C:\IMDbRatings\format.fmt');

A few things I tried were changing the .CSV file to UCS-2 BE, adding different conditions in the WITH clause of the BULK INSERT, and changing the variable type in the format file to SQLNCHAR instead of SQLCHAR, but nothing worked. Often what happens in these cases is "0 rows are affected", rather than an error. Any help would be so appreciated.

Walker
  • 153
  • 2
  • 9
  • 1
    And what collation is your database, table and column with that "Jean-Marc Vallée" name? – Whencesoever Jun 21 '16 at 20:56
  • All three are using SQL_Latin1_General_CP1_CI_AS – Walker Jun 21 '16 at 21:03
  • 1
    didn't you post this exact same question a few days ago? does it have to be bulkinsert for the method of importing? do you have SSIS available to you or??? – Matt Jun 21 '16 at 22:09
  • 1
    did you try saving the file in notepad++ in Code page 1252 and importing? – Matt Jun 21 '16 at 22:13
  • 1
    http://stackoverflow.com/questions/13996967/special-characters-displaying-incorrectly-after-bulk-insert this suggests using CODEPAGE = 'ACP' in your Bulkinsert statement – Matt Jun 21 '16 at 22:14
  • @Matt Thanks for the response. I'm not familiar with SSIS, so I didn't go that route. I will try CODEPAGE = 'ACP'. – Walker Jun 21 '16 at 22:19
  • @Matt I tried adding CODEPAGE = 'ACP', but it still leaves me with non-Latin characters. This problem is more complex than any of the questions I've seen pertaining to BULK INSERTs and collation, due to the complexity of adding a format file. – Walker Jun 21 '16 at 22:25

2 Answers2

4

I'm answering this old question in the hopes that it will save someone the headaches I recently went through.

Simply put: you should use the "" collation in your format file when inserting from a UTF-8 encoded file using code page 65001. You must have SQL Server 2016 for code page 65001 to be available.


Do the following:

  1. Specify that your bulk insert table is encoded in UTF-8 with CODEPAGE = 65001 in the bulk insert statement
  2. In your format file, specify character column types as SQLCHAR
  3. In your format file, use the "" collation for all columns

Bulk insert statement:

BULK INSERT [Test].[dbo].[rawData]
FROM 'C:\IMDbRatings\Files\ratings.csv'
WITH (CODEPAGE = 65001, FIRSTROW = 2, FORMATFILE= 'C:\IMDbRatings\format.fmt');

Format file:

13.0
16
1       SQLCHAR             0       1000    "\",\""    1     Position                   ""
2       SQLCHAR             0       1000    "\",\""    2     Const                      ""
3       SQLCHAR             0       1000    "\",\""    3     Created                    ""
4       SQLCHAR             0       1000    "\",\""    4     Modified                   ""
5       SQLCHAR             0       1000    "\",\""    5     Description                ""
6       SQLCHAR             0       1000    "\",\""    6     Title                      ""
7       SQLCHAR             0       1000    "\",\""    7     TitleType                  ""
8       SQLCHAR             0       1000    "\",\""    8     Directors                  ""
9       SQLCHAR             0       1000    "\",\""    9     YouRated                   ""
10      SQLCHAR             0       1000    "\",\""    10    IMDbRating                 ""
11      SQLCHAR             0       1000    "\",\""    11    Runtime                    ""
12      SQLCHAR             0       1000    "\",\""    12    Year                       ""
13      SQLCHAR             0       1000    "\",\""    13    Genres                     ""
14      SQLCHAR             0       1000    "\",\""    14    NumVotes                   ""
15      SQLCHAR             0       1000    "\",\""    15    ReleaseDate                ""
16      SQLCHAR             0       1000    "\""     16    URL                        ""

On "" or RAW collation from https://technet.microsoft.com/en-us/library/ms190657(v=sql.105).aspx:

Specifies that the data is stored in the code page that is specified in a code-page option in the command or the bcp_control BCPFILECP hint. If none of these is specified, the collation of the data file is that of the OEM code page of the client computer.

1

@Walker I admit I never use bulk insert but tried to setup your test case and just keep getting incomplete or cannot be read format file which I have and have saved. Anyway, try changing the encoding to 1252 In Notepad++ that's Encoding --> Character Sets--> Western Eurpoean --> Windows-1252 save the file and try the import

also I just saw this article How to write UTF-8 characters using bulk insert in SQL Server? which is interesting and suggests UTF-8 is problem until SQL 2016. But one answer that caught my eye is SQLNCHAR vs SQLCHAR because I think you are storing Unicode data your which would mean you need to change your data types in your format file and table you have crated.

Community
  • 1
  • 1
Matt
  • 13,833
  • 2
  • 16
  • 28
  • For the format file or the rating.csv? I really appreciate your help here. – Walker Jun 21 '16 at 23:18
  • 1
    rating.csv I don't think the format file will make a difference – Matt Jun 21 '16 at 23:19
  • 1
    I wrote a little more too I think you need to modify data types to nchar so you can handle Unicode characters as char won't store them. – Matt Jun 22 '16 at 00:13
  • When I change it to SQLNCHAR and run under the same conditions, I see just Chinese characters for everything. http://imgur.com/a/BxYyN – Walker Jun 22 '16 at 16:00
  • 1
    That's weird for sure but definitely something going on in regards to encoding. Did you ever try setting the encoding to windows-1252 in notepad and importing? – Matt Jun 22 '16 at 16:02
  • I did. I noticed that when I selected windows-1252 it did not show an indicator that it is actually selected and it did not allow me to save the file after changing it. – Walker Jun 22 '16 at 16:11
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115329/discussion-between-matt-and-walker). – Matt Jun 22 '16 at 16:16