bypassing Entity Framework to import / update a large amount of data

Question

i have a fully working production site based on entity framework and now i need to import a large amount of data weekly into the database. the data comes in the form of text files which i go through line by line, check against the database to see if it exists and if it does update anything that has changed or just insert it if not. the problem im having is that it takes around 32 hours to run the full import process and some of the files have to be manually split into smaller chunks to avoid memory issues seemingly caused by entity framework. i have managed to slow down the memory increase but the last time i ran a file without splitting it, it ran for about 12 hours before running out of memory at somewhere over 1.5gb. so can someone suggest to me the best way of importing this data, i have heard of sqlbulkcopy but wasnt sure if it was the correct thing to use. can anyone provide any examples? or suggest anything more appropriate. for instance, should i create a duplicate of the entity using standard .net sql commands and possibly use a stored procedure

I suggest you use BCP to import the data and then use raw SQL to merge/update the data. This is not something you should consider doing with EF. — qujck, Sep 05 '13 at 12:46
So line by line you parse the text file. Then you use EF line by line to either update or insert? Bulkcopy only does insert so need to know more about the process. Could batch the inserts for bulkcopy but then the potential problem is a record would not know it is an update if that record is in an insert batch not yet processed. — paparazzo, Sep 05 '13 at 13:17
Here are some simple steps to improve the performance for such bulk inserts/updates with EF: http://stackoverflow.com/a/5942176/270591 — Slauma, Sep 05 '13 at 18:03
the code reads the file line by line (it is a tab separated txt file) then for each line it converts the line to an entity object, it then uses an id in that object (which comes from the file) to look up that object in the database. if it exists then check what is different and update is if doesn't then create a new entity in the database. if anyone could provide an example of this scenario with SQL bulk copy or standard SQL .net methods either using a stored procedure or just sql statements i would appreciate it — Tom Hawkin, Sep 06 '13 at 11:45

Stephen Byrne · Accepted Answer · 2013-09-09T22:18:27.260

2

Although SqlBulkCopy is handy from managed code,I reckon the fastest way is to do it is in "pure" sql -- given that SqlBulkCopy doesn't easily do upserts, you would need to execute the MERGE part below anyway

Assuming that your text file is in csv format, and it exists on the SQL Server as "C:\Data\TheFile.txt", and that line endings are normalised as CR-LF (\r\n)

And let's assume that the data is ID,Value1,Value2

this SQL command will insert into a staging table TheFile_Staging which has ID,Value,Value2 columns with compatible data types, and then update the "real" table TheFile_Table (note: code below not tested!)

  truncate table TheFile_Staging
    BULK INSERT TheFile_Staging FROM'C:\Data\TheFile.txt'
 WITH (fieldterminator=',', rowTerminator='\r\n',FirstRow=2)
  //FirstRow=2 means skip Row#1 - use this when 1st row is a header.

MERGE TheFile_Table as target
USING (SELECT ID,Value1,Value2 from TheFile_Staging) as source
on target.ID = source.ID
WHEN MATCHED THEN
  UPDATE SET target.Value1=source.Value1, target.Value2=source.target2
WHEN NOT MATCHED THEN 
  INSERT (id,Value1,Value2) VALUES (source.Id,source.Value1,source.Value2);

You can create a stored procedure and set it to run or invoke from code, etc. The only problem with this approach is error handling bulk insert is a bit of a mess - but as long as your data coming in is ok then it's as quite fast.

Normally I'd add some kind of validation check in the WHERE clause us the USING() select of the MERGE to only take the rows that are valid in terms of data.

It's probably also worth pointing out that the definition of the staging table should omit any non-null, primary key and identity constraints, in order that the data can be read in without error esp. if there are empty fields here and there in your source data; and I also normally prefer to pull in date/time data as a plain nvarchar - this way you avoid incorrectly formatted dates causing import errors and your MERGE statement can perform a CAST or CONVERT as needed whilst at the same time ignoring and/or logging to an error table any invalid data it comes across.

edited Sep 09 '13 at 22:18

answered Sep 05 '13 at 22:56

Stephen Byrne

7,400
1
31
51

i tried this but a could not get it working with my file. here is a sample... "Customer_Id" "CustomerName" "CustomerAddress" "CustomerGender" "CustomerBirthDate" 1 "Jordan Reyes Toribio" "100 a test address" "FE" 1962-02-09 00:00:00 – Tom Hawkin Sep 06 '13 at 15:59
@TomHawkin, sorry to hear it's not working for you; can you give some details of that error you get when you try it? – Stephen Byrne Sep 07 '13 at 21:08
currently the error is 'Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 4 (CustomerGender).'the field is an nvarchar(10) – Tom Hawkin Sep 09 '13 at 08:19
@TomHawkin - is it perhaps the case that some of the data contains null for this field but the constraint on the table def is non-null? I should have mentioned that the staging table definition should normally have most constraints removed - see updated answer – Stephen Byrne Sep 09 '13 at 08:52
i realised the problem. the file has a header row. is there a way to always skip the first line? – Tom Hawkin Sep 09 '13 at 09:13
the bulk import is now working but as you can see from my test data a few comments up, the string fields have speech marks around them with tab delimiters. the speech marks are getting saved to the database and i would like to strip these out. also i cant get the merge to work at all, there are red lines everywhere even after i replaced the placeholders. – Tom Hawkin Sep 09 '13 at 14:31
this is what i currently have... truncate table Customers_temp BULK INSERT Customers_temp FROM'C:\Customers_update.txt' WITH (fieldterminator='\t', rowTerminator='\n', FirstRow=2) MERGE Customers AS target USING (SELECT * from Customers_temp) as source on (Destination.CustomerId = source.Customer_Id) WHEN MATCHED THEN UPDATE SET Destination.Name=source.CustomerName, [Destination.Address]=source.CustomerAddress WHEN NOT MATCHED THEN INSERT INTO Destination(Name,[Address]) VALUES (source.CustomerName,source.CustomerAddress); – Tom Hawkin Sep 09 '13 at 14:33
@TomHawkin - what version of SQL Server are you using? – Stephen Byrne Sep 09 '13 at 15:53
@TomHawkin; I have modified the syntax of the MERGE statement as I noticed I had made an error. It should work now, let us know if not. As to the leading and trailing " characters and so forth, well cleaning these is all part of the fun of bulk data import I'm afraid; you can do it on the fly during your insert statement I am sure, or make sure your source data doesn't have these artifacts in the first place... – Stephen Byrne Sep 09 '13 at 22:20

score 0 · Answer 2 · answered Sep 05 '13 at 12:45

Sadly you need to move away from Entity Framework in this kind of scenario; out of the box EF only does line-by-line inserts. You can do interesting things like this, or you can completely disregard EF and manually code the class that will do the bulk inserts using ADO.Net (SqlBulkCopy).

Edit: you can also keep with the current approach if the performance is acceptable, but you will need to recreate the context periodically, not use the same context for all records. I suspect that's the reason for the outrageous memory consumption.

bypassing Entity Framework to import / update a large amount of data

2 Answers2