0

I have records in two databases. That is the entity in the first database:

public class PersonInDatabaseOne
{
    public string Name { get; set; }
    public string Surname { get; set; }
}

That is the entity in the second database:

public class PersonInDatabaseTwo
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
}

How can I get records from the second database which don't exist in the first database (the first name and the last name must be different than in the first database). Now I have something like that but that is VERY SLOW, too slow:

List<PersonInDatabaseOne> peopleInDatabaseOne = new List<PersonInDatabaseOne>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 100000; i++)
{
    peopleInDatabaseOne.Add(new PersonInDatabaseOne { Name = "aaa" + i, Surname = "aaa" + i });
}


List<PersonInDatabaseTwo> peopleInDatabaseTwo = new List<PersonInDatabaseTwo>();
// Hear I generate objects but in real I take it from database:
for (int i = 0; i < 10000; i++)
{
    peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "aaa" + i, LastName = "aaa" + i });
}
for (int i = 0; i < 10000; i++)
{
    peopleInDatabaseTwo.Add(new PersonInDatabaseTwo { FirstName = "bbb" + i, LastName = "bbb" + i });
}


List<PersonInDatabaseTwo> peopleInDatabaseTwoWhichNotExistInDatabaseOne = new List<PersonInDatabaseTwo>();

// BELOW CODE IS VERY SLOW:
foreach (PersonInDatabaseTwo personInDatabaseTwo in peopleInDatabaseTwo)
{
    if (!peopleInDatabaseOne.Any(x => x.Name == personInDatabaseTwo.FirstName && x.Surname == personInDatabaseTwo.LastName))
    {
        peopleInDatabaseTwoWhichNotExistInDatabaseOne.Add(personInDatabaseTwo);
    }
};
Alexander
  • 31
  • 5

2 Answers2

4

The fastest way is dependent on the number of entities, and what indexes you already have.

  • If there's a few entities, what you already have performs better because multiple scans of a small set takes less than creating HashSet objects.

  • If all of your entities fit in the memory, the best way is to build HashSet out of them, and use Except which is detailed nicely by @alex.feigin.

  • If you can't afford loading all entities in the memory, you need to divide them into bulks based on the comparison key and load them into memory and apply the HashSet method repeatedly. Note that bulks can't be based on the number of records, but on the comparison key. For example, load all entities with names starting with 'A', then 'B', and so on.

  • If you already have an index on the database on the comparison key (like, in your case, FirstName and LastName) in one of the databases, you can retrieve a sorted list from the database. This will help you do binary search (http://en.wikipedia.org/wiki/Binary_search_algorithm) on the sorted list for comparison. See https://msdn.microsoft.com/en-us/library/w4e7fxsh(v=vs.110).aspx

  • If you already have an index on the database on the comparison key on both databases, you can get to do this in O(n), and in a scalable way (any number of records). You need to loop through both lists and find the differences only once. See https://stackoverflow.com/a/161535/187996 for more details.

Community
  • 1
  • 1
Iravanchi
  • 5,139
  • 9
  • 40
  • 56
1

Edit: with respect to the comments - using a real model and a dictionary instead of a simple set:

Try hashing your list into a Dictionary to hold your people objects, as the key - try a Tuple instead of a name1==name2 && lname1==lname2.

This will potentially then look like this:

// Some people1 and people2 lists of models already exist:
var sw = Stopwatch.StartNew();
var removeThese = people1.Select(x=>Tuple.Create(x.FirstName,x.LastName));
var dic2 = people2.ToDictionary(x=>Tuple.Create(x.Name,x.Surname),x=>x);
var result =  dic2.Keys.Except(removeThese).Select(x=>dic2[x]).ToList();
Console.WriteLine(sw.Elapsed);

I hope this helps.

alex.feigin
  • 386
  • 1
  • 11
  • I think you need to start the stopwatch before building the HashSet objects for benchmarking purpose, because their construction is a part of the algorithm used to diff between the lists. – Iravanchi Apr 25 '15 at 13:26
  • I can't use Except because I have entities differnt types in two databases and they have much more properties than only firstname and lastname – Alexander Apr 25 '15 at 13:28
  • 1
    You would do better to hash the results, and use Except anyway.. by for example using a Dictionary - with the key as above - and the value a real model. Admittedly this will take some additional space, but I see no better solution. – alex.feigin Apr 25 '15 at 13:31