2

Please see the following situation:

I do have a CSV files of which I import a couple of fields (not all in SQL server using Entity Framework with the Unit Of Work and Repository Design Pattern).

var newGenericArticle = new GenericArticle
{
    GlnCode = data[2],
    Description = data[5],
    VendorId = data[4],
    ItemNumber = data[1],
    ItemUOM = data[3],
    VendorName = data[12]
};

var unitOfWork = new UnitOfWork(new AppServerContext());
unitOfWork.GenericArticlesRepository.Insert(newGenericArticle);

unitOfWork.Commit();

Now, the only way to uniquely identify a record, is checking on 4 fields: GlnCode, Description, VendorID and Item Number.

So, before I can insert a record, I need to check whether or not is exists:

 var unitOfWork = new UnitOfWork(new AppServerContext());

 // If the article is already existing, update the vendor name.
 if (unitOfWork.GenericArticlesRepository.GetAllByFilter(
         x => x.GlnCode.Equals(newGenericArticle.GlnCode) &&
              x.Description.Equals(newGenericArticle.Description) &&
              x.VendorId.Equals(newGenericArticle.VendorId) &&
              x.ItemNumber.Equals(newGenericArticle.ItemNumber)).Any())
 {
     var foundArticle = unitOfWork.GenericArticlesRepository.GetByFilter(
         x => x.GlnCode.Equals(newGenericArticle.GlnCode) &&
              x.Description.Equals(newGenericArticle.Description) &&
              x.VendorId.Equals(newGenericArticle.VendorId) &&
              x.ItemNumber.Equals(newGenericArticle.ItemNumber));

     foundArticle.VendorName = newGenericArticle.VendorName;

     unitOfWork.GenericArticlesRepository.Update(foundArticle);
 }

If it's existing, I need to update it, which you see in the code above.

Now, you need to know that I'm importing around 1.500.000 records, so quite a lot. And it's the filter which causes the CPU to reach almost 100%.

The `GetAllByFilter' method is quite simple and does the following:

return !Entities.Any() ? null : !Entities.Where(predicate).Any() ? null : Entities.Where(predicate).AsQueryable();

Where predicate equals Expression<Func<TEntity, bool>>

Is there anything that I can do to make sure that the server's CPU doesn't reach 100%?

Note: I'm using SQL Server 2012

Kind regards

Martin
  • 39,569
  • 20
  • 99
  • 130
Complexity
  • 5,682
  • 6
  • 41
  • 84
  • I would suggest to [use a stored procedure](http://stackoverflow.com/a/17274826/806975), or try a [bulk insert extension](https://efbulkinsert.codeplex.com/) – Guillermo Gutiérrez Apr 14 '15 at 14:04
  • Why are you so `.Any()` happy? You are literally querying the database 10 times for every insert. Now, granted, that Any tends to use an EXISTS query, but it's still a query. In particular, you call Entities.Any(), then Any on the predicate, then return an iqueryable and then call Any on that again! Sheesh. – Erik Funkenbusch Apr 15 '15 at 17:40
  • But beyond that, EF is just not designed for this.. It's not a batch or bulk job processor... Use SqlBulkCopy class instead. – Erik Funkenbusch Apr 15 '15 at 17:43
  • @ErikFunkenbusch Any suggestion on how to get rid of the `Any()` implementation to make it more performant? – Complexity Apr 16 '15 at 06:34
  • Yes, just do a single query with a where clause and your four conditions with a SingleOrDefault (assuming it can only return a single record), and if it's null it means it doesn't exist, so skip the update. – Erik Funkenbusch Apr 16 '15 at 06:42

3 Answers3

2

Wrong tool for the task. You should never process a million+ records one at at time. Insert the records to a staging table using bulk insert and clean (if need be) and then use a stored proc to do the processing in a set-based way or use the tool designed for this, SSIS.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
1

I've found another solution which wasn't proposed here, so I'll be answering my own question.

I will have a temp table in which I will import all the data, and after the import, I'll execute a stored procedure which will execute a Merge command to populate the destinatio table. I do believe that this is the most performant.

Complexity
  • 5,682
  • 6
  • 41
  • 84
0

Have you indexed on those four fields in your database? That is the first thing that I would do.

Ok, I would recommend trying the following: Improving bulk insert performance in Entity framework

To summarize, Do not call SaveChanges() after every insert or update. Instead, call every 1-2k records so that the inserts/updates are made in batches to the database.

Also, optionally change the following parameters on your context:

yourContext.Configuration.AutoDetectChangesEnabled = false;
yourContext.Configuration.ValidateOnSaveEnabled = false;
Community
  • 1
  • 1
jle
  • 9,316
  • 5
  • 48
  • 67
  • Not done it yet. Will do right away. Didn't understand on how I missed that. Any other ideas that you have? – Complexity Apr 14 '15 at 13:57
  • The fields are index right now, but the problem remains the same. I do have a table that has 4 keys (the 4 columns that makes a record unique). After that, I've created an index for those 4 keys, but the problem still remains. – Complexity Apr 15 '15 at 12:42
  • In that case you might want to use a stored procedure with a MERGE statement. You can call stored procedures from entity framework – jle Apr 15 '15 at 13:06
  • I will give that a try, but is that such a performance boost as I still need to execute the SP for every record in the file to import. Meaning, 1.500.000 calls to the Stored Procedure? – Complexity Apr 15 '15 at 13:11
  • No, you can use a table value parameter and send it in batches. https://mikesdotnet.wordpress.com/2013/03/17/inserting-data-into-a-sql-server-database-using-a-table-valued-parameter/ – jle Apr 15 '15 at 13:49
  • Better link for EF: http://stackoverflow.com/questions/8157345/entity-framework-stored-procedure-table-value-parameter – jle Apr 15 '15 at 13:51
  • Ok, I'm a bit lost here. What's the first approach that I should try? Using a Stored Procedure with the MERGE command and executing one by one. Or use a table value parameter and send it in batches? Or does those 2 goes along and can't the first be used without the second? – Complexity Apr 15 '15 at 13:51
  • You can send your data in a table parameter and use the merge command on that data to either insert or update. – jle Apr 15 '15 at 13:52
  • Actually, it looks like you can use this method as well: http://stackoverflow.com/questions/6107206/improving-bulk-insert-performance-in-entity-framework?lq=1 – jle Apr 15 '15 at 13:53
  • Now I see it, and never do a delete, so records will never be removed. Any idea on how to accomplish the same is some records might be removed? – Complexity Apr 15 '15 at 13:54
  • I've accepted your answer already, but I do hope that you can point me out on how to work with table parameters and the merge command in combination with deletion of data. – Complexity Apr 15 '15 at 13:58