1

I'm inserting a lot of data, wrapped in a transaction (like 2 million+ rows) at a time, using EF 4.1. Now I'd like to add UPDATE logic. Keep in mind, change-tracking is disabled given the volume of data. Off the top of my head, I'd do something like this:

// Obviously simplified code...
public void AddOrUpdate(Foo foo)
{
    if(!db.Foos.Any(x => someEqualityTest(foo)))
    {
        db.Foos.Add(foo);
    }

    else
    {
        var f = db.Foos.First(x => someEqualityTest(foo));
        f = foo;
    }

    db.SaveChanges();
}

Any ideas on how possibly to improve on this?

Didaxis
  • 8,486
  • 7
  • 52
  • 89
  • Can you shed a little more light on how you check if two instances of Foo are equal? Is it a simple ID comparison? – Justin Rusbatch Mar 28 '12 at 17:43
  • Could be, I was just trying to generalize the specifics, and focus on the larger logical aspects of the solution – Didaxis Mar 28 '12 at 18:33

2 Answers2

2

I would keep the inserts separate from the updates.

For inserts, I would recommend using SqlBulkCopy to insert all records which don't already exist and it's going to be way faster.

First, the Bulk Insert method in your DbContext:

public class YourDbContext : DbContext
{
    public void BulkInsert<T>(string tableName, IList<T> list)
    {
        using (var bulkCopy = new SqlBulkCopy(base.Database.Connection))
        {
            bulkCopy.BatchSize = list.Count;
            bulkCopy.DestinationTableName = tableName;

            var table = new DataTable();
            var props = TypeDescriptor.GetProperties(typeof(T))
                          // Dirty hack to make sure we only have system 
                          // data types (i.e. filter out the 
                          // relationships/collections)
                          .Cast<PropertyDescriptor>()
                          .Where(p => "System" == p.PropertyType.Namespace)
                          .ToArray();

            foreach (var prop in props)
            {
                bulkCopy.ColumnMappings.Add(prop.Name, prop.Name);

                var type = Nullable.GetUnderlyingType(prop.PropertyType) 
                           ?? prop.PropertyType;

                table.Columns.Add(prop.Name, type);
            }

            var values = new object[props.Length];
            foreach (var item in list)
            {
                for (var i = 0; i < values.Length; i++)
                {
                    values[i] = props[i].GetValue(item);
                }

                table.Rows.Add(values);
            }

            bulkCopy.WriteToServer(table);
        }
    }
}

Then, for your insert/update:

public void AddOrUpdate(IList<Foo> foos)
{
    var foosToUpdate = db.Foos.Where(x => foos.Contains(x)).ToList();

    var foosToInsert = foos.Except(foosToUpdate).ToList();

    foreach (var foo in foosToUpdate)
    {
        var f = db.Foos.First(x => someEqualityTest(x));

        // update the existing foo `f` with values from `foo`
    }

    // Insert the new Foos to the table named "Foos"
    db.BulkInsert("Foos", foosToinsert);

    db.SaveChanges();
}
Justin Rusbatch
  • 3,992
  • 2
  • 25
  • 43
  • Thank you, I've created a provider-based solution, and indeed the current provider employs SqlBulkCopy (inserts only), and yes, it is way faster! I'm revisiting the EF provider because now I'd like insert/update logic as well. But you do provide a very clever solution, I like it! – Didaxis Mar 28 '12 at 18:11
  • switched answer. i think your solution is the right approach here – Didaxis Mar 28 '12 at 18:15
  • 1
    @ErOx and Justin: I really disagree with that solution. The BulkInsert is good! BUT: You asked about UPDATES and for UPDATES this code is a bad recommendation. First: The `f = foo` trick does not work (see my answer). Second: `x => foos.Contains(x)` will throw an exception with EF. It must be `x => foos.Select(f => f.Id).Contains(x.Id)`. Third and most important: Using `Contains` for many objects is a total performance killers: http://stackoverflow.com/a/8108643/270591 (+ the link in that answer). – Slauma Mar 28 '12 at 18:24
  • @Slauma Thanks for the feedback. The `Contains` call was something I was concerned about, which is why I commented on the question asking for clarification about how the `someEqualityTest` method. – Justin Rusbatch Mar 28 '12 at 18:27
  • Thank you Slauma. I'm sure my actual solution will be a combination of things here, and definitely thanks for the link to your "Contains" analysis, I'll obviously use the SqlQuery method now! – Didaxis Mar 28 '12 at 18:32
1

Your update...

var f = db.Foos.First(x => someEqualityTest(foo));
f = foo;

...won't work because you are not changing the loaded and attached object f at all, you just overwrite the variable f with the detached object foo. The attached object is still in the context, but it has not been changed after loading and you don't have a variable anymore which points to it. SaveChanges will do nothing in this case.

The "standard options" you have are:

var f = db.Foos.First(x => someEqualityTest(foo));
db.Entry(f).State = EntityState.Modified;

or just

db.Entry(foo).State = EntityState.Modified;
// attaches as Modified, no need to load f

This marks ALL properties as modified - no matter if they really changed or not - and will send an UPDATE for each column to the database.

The second option which will only mark the really changed properties as modified and only send an UPDATE for the changed columns:

var f = db.Foos.First(x => someEqualityTest(foo));
db.Entry(f).CurrentValues.SetValues(foo);

Now, with 2 million objects to update you don't have a "standard" situation and it is possible that both options - especially the second which likely uses reflection internally to match property names of source and target object - are too slow.

The best option when it comes to performance of updates are Change Tracking Proxies. This would mean that you need to mark EVERY property in your entity class as virtual (not only the navigation properties, but also the scalar properties) and that you don't disable creation of change tracking proxies (it is enabled by default).

When you load your object f from the database EF will create then a dynamic proxy object (derived from your entity), similar to lazy loading proxies, which has code injected into every property setter to maintain a flag if the property has been changed or not.

The change tracking provided by proxies is much faster than the snapshot based change tracking (which happens in SaveChanges or DetectChanges).

I am not sure though if the two options mentioned above are faster if you use change tracking proxies. It is possible that you need manual property assignments to get the best performance:

var f = db.Foos.First(x => someEqualityTest(foo));
f.Property1 = foo.Property1;
f.Property2 = foo.Property2;
// ...
f.PropertyN = foo.PropertyN;

In my experience in a similar update situation with a few thousand of objects there is no real alternative to change tracking proxies regarding performance.

Slauma
  • 175,098
  • 59
  • 401
  • 420
  • Excellent, thoughtful answer! Just the kind of thing I was looking for! – Didaxis Mar 28 '12 at 17:35
  • @ErOx: BTW: Did you really insert 2 million objects sucessfully with a single `SaveChanges` call at the end? I've just grabbed an old measurement for just half a million objects and I had an out-of-memory exception when I call `SaveChanges` only once: http://stackoverflow.com/a/5942176/270591 – Slauma Mar 28 '12 at 17:53
  • good catch - no, i'm recycling the context every so many additions very similar to that other question you linked to. So yes, the code above is quite simplified – Didaxis Mar 28 '12 at 18:05