36

I'm using ASP.NET MVC4 with Entity Framework Code First. I have a table called "users", with primary key "UserId". This table may have 200,000+ entries.

I need to insert another 50 users. I might do this like

foreach(User user in NewUsers){
    context.Add(user);
}
dbcontext.SaveChanges();

The problem is, one or more of those new users might already exist in the DB. If I add them and then try to save, it throws an error and none of the valid ones get added. I could modify the code to do this:

foreach(User user in NewUsers){
    if(dbcontext.Users.FirstOrDefault(u => u.UserId) == null)
    {
        dbcontext.Users.Add(user);
    }
}
dbcontext.SaveChanges();

which would work. The problem is, then it has to run a query 50 times on a 200,000+ entry table. So my question is, what is the most performance efficient method of inserting these users, ignoring any duplicates?

Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
Jordan
  • 3,998
  • 9
  • 45
  • 81
  • 3
    perhaps `context.AddOrUpdate(user);` is what you want – Johan Larsson Aug 07 '13 at 20:46
  • 1
    @O.O It's not actually users, it's different data that's being pulled from an API. The API may or may not provide the same data in multiple sequential calls. I just used "Users" because it was the first example that I thought of. – Jordan Aug 07 '13 at 20:56
  • 1
    I had the same problem and didn't find an adequate solution. There are valid scenarios for this, such as when doing a bulk import from CSV where the email address needs to be unique in an existing database. Reading all existing keys into memory doesn't seem good for performance, neither is adding each entry separately. It seems that what is needed is something like INSERT IGNORE. – acarlon Oct 10 '13 at 11:28

6 Answers6

17

You can do this:

var newUserIDs = NewUsers.Select(u => u.UserId).Distinct().ToArray();
var usersInDb = dbcontext.Users.Where(u => newUserIDs.Contains(u.UserId))
                               .Select(u => u.UserId).ToArray();
var usersNotInDb = NewUsers.Where(u => !usersInDb.Contains(u.UserId));
foreach(User user in usersNotInDb){
    context.Add(user);
}

dbcontext.SaveChanges();

This will execute a single query in your database to find users which already exist, then filter them out of your NewUsers set.

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • For large collections, would it be better to use HashSets instead of .ToArray()? – tbmsu Apr 27 '16 at 15:12
  • 1
    @tbmsu For entity framework calls, it probably wouldn't have any performance effect because the `Contains` actually gets translated to a SQL `IN` clause so it's really dependent on DB performance more than anything else. Also note that for *large* data sets, this probably wouldn't work at all because there's a limit to the number of values you can use in an `IN` clause ([ref](http://stackoverflow.com/questions/1069415/limit-on-the-where-col-in-condition)). – p.s.w.g Apr 27 '16 at 16:23
  • Could you please clarify what will execute in a single query? Would the `savechanges()` function cause a single query or the entire snippet will? – Nexaspx Apr 02 '19 at 08:47
  • @Nexaspx `context.Add(user)` marks each `user` entity as being modified, but nothing is actually executed against your database until you run `SaveChanges()`. – p.s.w.g Apr 02 '19 at 15:58
  • that really sucks bro! you are literally loading the whole DB into memory!!! NOT an optimal solution! – Mohammad Khodabandeh Nov 01 '22 at 13:47
  • 1
    @MohammadKhodabandeh Not really. It only selects the subset of `UserId`'s that are in the list of `NewUsers`. Sure if you're doing something like trying to merge millions of records, there are better solutions, but OP suggested `NewUsers` had ~50 items, so the maximum size of the result set in `usersInDb` would also be ~50. – p.s.w.g Dec 21 '22 at 19:16
3

Since this is your primary key, your options are limited. If this wasn't your primary key, and just a unique index, assuming SQL Server, you could set up your unique key to ignore duplicates.

What I might suggest is to simply wrap a try/catch around the Add and eat the exception if the exception is a duplicate key error.

You might also see if your object supports the AddOrUpdate() method. I know this is supported in Code First implementations. I believe in this case it will do an add on a new or update if the row exists. However, this might still involve a trip to the DB to see if the user already exists in order to know whether to do an add or update. And, in some cases, you might not want to actually perform an update.

I think if it were me, I'd go the Try/Catch route.

Randy Minder
  • 47,200
  • 49
  • 204
  • 358
  • 1
    So you're saying put the SaveChanges() with a try/catch around it in the for-loop. Would this not add a lot of overhead from having 50 different SaveChanges() calls and their cooresponding transactions and write operations? – Jordan Aug 07 '13 at 20:51
  • @Jordan - Regarding the transactions, there should only be one transaction. You shouldn't have multiple transactions occuring here. Regarding the overhead, yes, there will be some. But, will it be greater than making 50 (or whatever) separate calls to see if each user exists first? The best way to find out is to test it and see. Personally, I think I'd rather take the in-memory overhead of using Try/Catch than making repeated DB calls. But it's going to depend on how many users you typically enter at one time. – Randy Minder Aug 07 '13 at 20:58
3

You can filter out the existing users with one query

foreach(User user in NewUsers.Where(us => !dbcontext.Users.Any(u => u.userId == us.userId)))
{
    dbcontext.Users.Add(user);
}
dbcontext.SaveChanges();

EDIT:

As pointed out in the comments the proposal above will result in an sql call for each element in the NewUsers collection. I could confirm that with SQL Server Profiler.

One intresting result of the profiling is the somewhat wierd sql generated by EF for each item(Model names are different than in the OP, but the query is the same):

exec sp_executesql N'SELECT 
CASE WHEN ( EXISTS (SELECT 
    1 AS [C1]
    FROM [dbo].[EventGroup] AS [Extent1]
    WHERE [Extent1].[EventGroupID] = @p__linq__0
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT 
    1 AS [C1]
    FROM [dbo].[EventGroup] AS [Extent2]
    WHERE [Extent2].[EventGroupID] = @p__linq__0
)) THEN cast(0 as bit) END AS [C1]
FROM  ( SELECT 1 AS X ) AS [SingleRowTable1]',N'@p__linq__0 int',@p__linq__0=10

Quite a nice piece of code to do the job of a simple one-liner.

My point of view is that writing nice and readable declarative code and let the compiler and optimizer do the dirty job is a great attitude. This is one of the cases when the result of such a style is surprising and you have to go dirty.

Hari
  • 4,514
  • 2
  • 31
  • 33
  • Works perfectly, and only needs to run once (the where query only runs one time on the entire table, then for iterates through results). I like it. – Jordan Aug 07 '13 at 21:01
  • @p.s.w.g Nice catch, I edited the answer to fix the syntax errors I noticed. – Hari Aug 07 '13 at 21:01
  • @Jordan No it doesn't if `NewUsers` is an in-memory collection. In that case it has to loop through every item and evaluate the `Any` on each one. This doesn't really improve on your original code, it just rearranges it in a more aesthetically pleasing manner. – p.s.w.g Aug 07 '13 at 21:06
  • Oh I see what your saying. Hmm. – Jordan Aug 07 '13 at 21:09
  • Well I cannot defend my answer right now, I will execute it tomorrow and either confirm the objections of @p.s.w.g and delete the answer or leave it alone – Hari Aug 07 '13 at 21:10
  • @Hari Knowing when and how Linq queries get executed is very important, especially when it comes to performance. I'd say its fine to leave the answer up, as a valid way to make a bit cleaner, as long as you remove the incorrect statement about it requiring only one query. – p.s.w.g Aug 07 '13 at 21:18
  • Why using a foreach instead of a AddRange? – IRONicMAN Nov 19 '18 at 14:01
2

You can just filter out the failures and keep trying until it succeeds or you get another sort of Exception

public partial class YourEntities: DbContext
{
    public override int SaveChanges()
    {
        var isSaved = false;
        do
        {
            try
            {
                return base.SaveChanges();
            }
            catch (DbUpdateException ex)
            {
                var entries = ex.Entries;
                foreach (var entry in entries)
                {
                    // change state to remove it from context 
                    entry.State = EntityState.Detached;
                }
            }
        }
        while (!isSaved);

        return null;    // never gets here
    }
}

You might want to put in some extra logic to avoid an infinite loop.

Rob Sedgwick
  • 4,342
  • 6
  • 50
  • 87
0

The following extension method will allow you to insert records of any type while ignoring duplicates:

 public static void AddRangeIgnore(this DbSet dbSet, IEnumerable<object> entities)
    {
        var entitiesList = entities.ToList();
        var firstEntity = entitiesList.FirstOrDefault();

        if (firstEntity == null || !firstEntity.HasKey() || firstEntity.HasIdentityKey())
        {
            dbSet.AddRange(entitiesList);
            return;
        }

        var uniqueEntities = new List<object>();

        using (var dbContext = _dataService.CreateDbContext())
        {
            var uniqueDbSet = dbContext.Set(entitiesList.First().GetType());

            foreach (object entity in entitiesList)
            {
                var keyValues = entity.GetKeyValues();
                var existingEntity = uniqueDbSet.Find(keyValues);

                if (existingEntity == null)
                {
                    uniqueEntities.Add(entity);
                    uniqueDbSet.Attach(entity);
                }
            }
        }

        dbSet.AddRange(uniqueEntities);
    }

    public static object[] GetKeyValues(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            var value = objectStateEntry.EntityKey
                                        .EntityKeyValues
                                        .Select(kv => kv.Value)
                                        .ToArray();
            return value;
        }
    }

    public static bool HasKey(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            return objectStateEntry.EntityKey != null;
        }
    }

    public static bool HasIdentityKey(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            var keyPropertyName = objectStateEntry.EntityKey
                                        .EntityKeyValues
                                        .Select(kv => kv.Key)
                                        .FirstOrDefault();

            if (keyPropertyName == null)
            {
                return false;
            }

            var keyProperty = entityType.GetProperty(keyPropertyName);
            var attribute = (DatabaseGeneratedAttribute)Attribute.GetCustomAttribute(keyProperty, typeof(DatabaseGeneratedAttribute));
            return attribute != null && attribute.DatabaseGeneratedOption == DatabaseGeneratedOption.Identity;
        }
    }
Matt Sinclair
  • 1,112
  • 9
  • 10
0

I've found a package https://github.com/artiomchi/FlexLabs.Upsert in the comments for the issue of this feature on the EF Core repo: https://github.com/dotnet/efcore/issues/16949#issuecomment-911120274

By using its NoUpdate() which maps to INSERT IGNORE syntax:

db.Set<Users>().UpsertRange(NewUsers).NoUpdate().Run()

you can just insert entities without handing duplicate errors or doing SELECT ... FOR UPDATE and filter entities that already exist in DB out of your entities to be inserted.

n0099
  • 520
  • 1
  • 4
  • 12
  • So does this prevent one query per user? I don't think so. Either way, you have to address the issue in the question. Offering an alternative upsert solution isn't asked for and not enough. – Gert Arnold Jan 13 '23 at 19:22
  • From the question: `So my question is, what is the most performance efficient method of inserting these users, ignoring any duplicates?`, I think `INSERT IGNORE` will be the most convenient (just add `IGNORE` in the raw SQL, for this library it will be like `db.Set().UpsertRange(NewUsers).NoUpdate().Run()`) and fastest(no need to try `SELECT` all existing records from DB) approach. – n0099 Jan 13 '23 at 19:23
  • Please don't add new info in comments. If this library deals with individual upserts it's highly unlikely that it is faster, whichever setting you use. You have to test if it doesn't execute one query per item to check if it exists. – Gert Arnold Jan 13 '23 at 19:35
  • It doesn't have to. The `INSERT IGNORE` syntax is done by serverside of the database, its design goal is to allow the client easily insert many records that might be duplicated with existing ones into a table without manually distinguishing which record is existing and which isn't. – n0099 Jan 13 '23 at 19:47
  • And I'm currently using this library, within MySQL general log I see statements like `INSERT IGNORE INTO table (fields) VALUES (entity1,entity2,entity3)`. – n0099 Jan 13 '23 at 19:52
  • However if the `.Run()` is get invoked within each for loop(like the asker is doing `DbSet<>.Add()` in `foreach`), then it will produce as many queries as the loop ends since `.Run()` will eventually invoke the [`DbContext.Database.ExecuteSqlRaw`](https://learn.microsoft.com/en-us/dotnet/api/microsoft.entityframeworkcore.relationaldatabasefacadeextensions.executesqlraw), [source](https://github.com/artiomchi/FlexLabs.Upsert/blob/732ea9d2c1ad080aecbd2ee1aac2aba96c3df0f7/src/FlexLabs.EntityFrameworkCore.Upsert/Runners/RelationalUpsertCommandRunner.cs#L379). – n0099 Jan 13 '23 at 19:52