How to bulk insert 20 100mb CSV files into SQL Server

Question

I have about 20 .csv files which are around 100-200mb each.

They each have about 100 columns.

90% of the columns of each file are the same; however, some files have more columns and some files have less columns.

I need to import all of these files into one table in a sql server 2008 database.

If the field does not exist, I need it to be created.

question: What should be the process with this import? How do I more efficiently and quickly import all of these files into one table in a database, and make sure that if a field does not exist, then it is created? Please also keep in mind that the same field might be in a different location. For example, CAR can be in field AB in one csv whereas the same field name (CAR) can be AC in the other csv file. The solution can be SQL or C# or both.

Because it may be easier to create SSIS package or temporarily import the files to dumb tables then using TSQL import the data. — John Hartsock, Dec 20 '11 at 20:27
for a one time just import each file using bcp. You should already have clear understanding of how many columns there are and how each column in the table maps to a field in each csv. You should also understand how the csv data denormalizes into relations. — Remus Rusanu, Dec 20 '11 at 20:38

score 3 · Answer 1 · edited May 23 '17 at 12:27

I would recommend looking at the BCP program which comes with SQL Server and is intended to help with jobs just like this:

http://msdn.microsoft.com/en-us/library/aa337544.aspx

There are "format files" which allow you to specify which CSV columns go to which SQL columns.

If you are more inclined to use C#, have a look at the SqlBulkCopy class:

http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Also take a look at this SO thread, also about importing from CSV files into SQL Server:

SQL Bulk import from CSV

score 3 · Accepted Answer · answered Dec 20 '11 at 19:26

3

You may choose a number of options 1. Use the DTS package 2. Try to produce one uniform CSV file, get the db table in sync with its columns and bulk insert it 3. Bulk insert every file to its own table, and after that merge the tables into the target table.

answered Dec 20 '11 at 19:26

Oleg Dok

21,109
4
45
54

i love your suggestion #3. can you tell me how would i combine the three tables? – Alex Gordon Dec 20 '11 at 19:28
@OlegDok thank you can you please show me how would i would combine the tables? what would the syntax be? – Alex Gordon Dec 20 '11 at 19:37
1

Insert into target(colA, colB, colC) select null, colB, colC from importedA union all select colA, colB, colC from importedB union all select colA, null, null from importedC – Oleg Dok Dec 20 '11 at 19:44
1

If you need to skip duplicates - use union instead of union all – Oleg Dok Dec 20 '11 at 19:45
1

@I__: if you do do the way of unions, use "union all" as "union" runs a distinct. This is what I would do once you've inserted. #1) Add any extra columns to the destination table #2) programmatically generate a "begin tran insert dest_table([columns1]....[columnN]) select [columns1]....[columnN] from src_table commit tran" – Bengie Dec 20 '11 at 19:48
1

@Bengie - I already noticed it in a separate comment, but thanx. I use sep. Comments just because i'm writing here from the mobile phone. It is painful :-) – Oleg Dok Dec 20 '11 at 19:50
1

@Oleg Dok: I think I mixed up the names and corrected it. Sorry about any confusion. I guess my confusion is contagious. – Bengie Dec 20 '11 at 19:59

score 1 · Answer 3 · answered Dec 20 '11 at 19:22

I recommend writing a small c# application that reads each of the CSV file headers and stores a dictionary of the columns needed and either outputs a 'create table' statement or directly runs a create table operation on the database. Then you can use Sql Management Studio to load the 20 files individually using the import routine.

score 1 · Answer 4 · answered Dec 20 '11 at 19:28

1

Use SqlBulkCopy class in System.Data.SqlClient

It facilitates bulk data transfer. only catch it wont work with DataTime DB column

answered Dec 20 '11 at 19:28

Darshan

121
1
1
8

TheBoyan · Answer 5 · 2011-12-20T19:41:39.377

There are several possibilities that you have here.

You can use SSIS if it is available to you.
In Sql Server you can use SqlBulkCopy to bulk insert in a staging table where you will insert the whole .csv file and then use a stored procedure with possibly MERGE statement in it to place each row where it belongs or create a new one if it doesn't exist.
You can use C# code to read the files and write them using SqlBulkInsert or EntityDataReader

score 1 · Answer 6 · answered Dec 20 '11 at 19:29

Less of an answer and more of a direction, but here I go. The way I would do it is first enumerate the column names from both the CSV files and the DB, then make sure the ones from your CSV all exist in the destination.

Once you have validated and/or created all the columns, then you can do your bulk insert. Assuming you don't have multiple imports happening at the same time, you could cache the column names from the DB when you start the import, as they shouldn't be changing.

If you will have multiple imports running at the same time, then you will need to make sure you have a full table lock during the import, as race conditions could show up.

I do a lot of automated imports for SQL DBs, and I haven't ever seen what you asked, as it's an assumed requirement that one knows the data that is coming in to the DB. Not knowing columns ahead of time is typically a very bad thing, but it sounds like you have an exception to the rule.

I've seen this exact case in finance. I'm not at all surprised. But yes, it is annoying/awkward. — Mark Canlas, Dec 20 '11 at 19:33

score 1 · Answer 7 · answered Dec 20 '11 at 19:32

Roll your own.

Keep (or create) a runtime representation of the target table's columns in the database. Before importing each file, check to see if the column exists already. If it doesn't, run the appropriate ALTER statement. Then import the file.

The actual import process can and probably should be done by BCP or whatever Bulk protocol you have available. You will have to do some fancy kajiggering since the source data and destination align only logically, not physically. So you will need BCP format files.

score 0 · Answer 8 · answered Dec 20 '11 at 19:30

0

For those data volumes, you should use an ETL. See this tutorial.

ETLs are designed for large amount of data manipulation

answered Dec 20 '11 at 19:30

Romias

13,783
7
56
85

Would be nice to know why the downvote... a quick comment could help ;) – Romias Dec 20 '11 at 19:34

How to bulk insert 20 100mb CSV files into SQL Server

8 Answers8

Roll your own.