1

I am gathering data from multiple feeds including api's, excel files, text files, word files. I am using a relational database to store all relationships. There are up to 10 one to many or many to many relationships.

The approach I am using is writing each entry into a .csv file then calling a stored procedure to bulk insert all of the entries. So in this case I can have 10 separate files for each table in my database.

There are 2 problems I ran into:

  • Transferring the files over to the database server (same network)
  • Primary keys, I need to use guid instead of auto increment

What is the best approach for performance?

MIKE
  • 1,039
  • 7
  • 24
  • possible duplicate of [Insert 2 million rows into SQL Server quickly](http://stackoverflow.com/questions/13722014/insert-2-million-rows-into-sql-server-quickly) – Charles Mager Jun 23 '15 at 16:02
  • He already has a text file. I don't. – MIKE Jun 23 '15 at 16:13
  • I'll admit I didn't read the question in too much detail - I was looking for an example question that used `SqlBulkCopy`. If you're trying to do this from .NET, that's what you should be looking at. There will be other questions / tutorials that will explain how it can be used. – Charles Mager Jun 23 '15 at 16:16
  • What in .net should I look at? I already know how to do this in 4 different ways using .Net. No one has seem to talk about my approach. Is there a flaw with my approach? – MIKE Jun 23 '15 at 16:24

2 Answers2

5

2 words: BULK INSERT

if you already have a csv file, this is simply a case of writing some SQL or C# (which ever you prefer) to execute a bulk insert.

Here are the SQL docs: https://msdn.microsoft.com/en-gb/library/ms188365.aspx

BULK INSERT MySchema.MyTable
FROM 'c:\myfile.csv'
WITH 
  (
     FIELDTERMINATOR =',',
     ROWTERMINATOR ='\n'
  );

And the C# docs: https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy%28v=vs.110%29.aspx

Jamiec
  • 133,658
  • 13
  • 134
  • 193
  • No I don't already have a csv file. The data is scraped from api's, excel files, multiple feeds, and put into my relational models. Is it quicker to use entityframework's context.addrange(list), save changes, clear memory, continue loop, or write to text file, without storing in a list, and bulk inserting? Are there better solutions? – MIKE Jun 23 '15 at 16:12
  • You'll never get insert performance like BULK INSERT from EF. Ever. The fastest way to load data is almost certainly BULK INSERT. – Jamiec Jun 23 '15 at 16:13
  • So you agree with my approach? If I have 10 relationships Person, Person.School, School.Teachers, Person.Teacher... etc. Writing each relationship to a text file then bulk inserting each (command pattern style) is the best approach? Even with transferring each file across a network to reach the db server and using guids instead of auto increment primary keys? This is the best solution I've come up with I just wanted to know what experts think. – MIKE Jun 23 '15 at 16:16
  • Using GUIDs as primary keys is an option, just make sure you do not make them clustered. Otherwise you will soon have a lot of fragmentation inside your tables (you can try to use sequential guids instead). You will also get fragmentation if you will create indexes on those GUID, which usually is done for foreign and primary keys (even if they are not clustered). However if this is your only input to those tables (bulk insert) you can still use INTs as primary keys with specifying identity insert when bulkcopying and fill INTs in text files also. Just make sure they match between entities. – Jure Jun 23 '15 at 16:33
  • The table itself with guid as primary key I will not cluster on guid however, tables with a relationship to this table and have a foreignkey to this guid I will be clustering on the guid. What do you mean by fragmentation? Also what are your thoughts on the chances of duplicate guid? – MIKE Jun 23 '15 at 16:45
0

I've built a small tool for that https://github.com/MikaelEliasson/EntityFramework.Utilities#batch-insert-entities or the Nuget link https://www.nuget.org/packages/EFUtilities/

It will use SqlBulkCopy from your in memory lists. It uses EF metadata so you don't have to configure that yourself. The code looks like this:

using (var ctx = new Context())
{
    EFBatchOperation.For(ctx, ctx.Locations).InsertAll(locations);
}

This is from a small demo I made https://github.com/MikaelEliasson/EFUtilitiesDemos/blob/master/BulkInsertAndUpdate/Program.cs#L46

The speed depends a lot on how many bits your entities are. My tests shows I can insert ~100 000 objects/s for medium sized entities.

If you have guids the relational insert should be fairly easy to do like you already did.

Because you have multiple inserts I suggest you use a transaction scope. See https://github.com/MikaelEliasson/EntityFramework.Utilities/issues/26

EDIT

If you prefer to use int or longs that will be included in the next release. It will take a bit longer but you can enable Id return for store generated ids.

See: https://github.com/MikaelEliasson/EntityFramework.Utilities/blob/release20/EntityFramework.Utilities/Tests/InsertTests.cs#L125

That code is working now but the release is not ready. But you could download and build the realease20 branch yourself if you want to try it now.

Mikael Eliasson
  • 5,157
  • 23
  • 27