4

I have two entities, Class and Student, linked in a many-to-many relationship.

When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.

For example:

{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }

{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }

There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.

I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:

var sb = new StringBuilder();
using (var ctx = new Ctx()) {
  ctx.CommandTimeout = 10000; // Because the next line takes so long!
  var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
  foreach (var c in allClasses) {
    var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
    foreach (var d in duplicates)
      sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
  }
}
lblResult.Text = sb.ToString();

This is no good because I get the error:

NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.

Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId in LINQ.

Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.

I'd really appreciate some advice.

Community
  • 1
  • 1
James
  • 7,343
  • 9
  • 46
  • 82
  • If anyone's interested, you can follow this question on Code Review http://codereview.stackexchange.com/q/5396/5944 – James Oct 16 '11 at 09:29

1 Answers1

4

The comparison of the SubjectId is not the problem because c.SubjectId is a value of a primitive type (int, I guess). The exception complains about Equals(c.Students). c.Students is a constant (with respect to the query duplicates) but not a primitive type.

I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first foreach loop: It executes the query allClasses. Then inside of the loop you extend the IQueryable allClasses to the IQueryable duplicates which gets executed then in the inner foreach loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.

So I would try to perform the content of the first foreach in memory. For the comparison of the Students list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
        .ToList(); // executes query, allClasses is now a List, not an IQueryable

    // everything from here runs in memory
    foreach (var c in allClasses)
    {
        var duplicates = allClasses.Where(
           o => o.SubjectId == c.SubjectId &&
           o.Id != c.Id &&
           o.Students.OrderBy(s => s.Name).Select(s => s.Name)
            .SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));

        // duplicates is an IEnumerable, not an IQueryable
        foreach (var d in duplicates)
            sb.Append(d.LongName)
              .Append(" is a duplicate of ")
              .Append(c.LongName)
              .Append("<br />");
    }
}
lblResult.Text = sb.ToString();

Ordering the sequences by name is necessary because, I believe, SequenceEqual compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.


Edit To your comment that the first query is still slow.

If you have 1300 classes with 30 students each the performance of eager loading (Include) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs a JOIN between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.

An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
    foreach (var c in allClasses)
    {
        // "Explicite loading": This is a new roundtrip to the DB
        ctx.LoadProperty(c, "Students");
    }

    foreach (var c in allClasses)
    {
        // ... same code as above
    }
}
lblResult.Text = sb.ToString();

You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no JOIN between classes and students).

Explicite loading is explained here:

If you work with Lazy Loading the first foreach with LoadProperty would not be necessary as the Students collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.

Community
  • 1
  • 1
Slauma
  • 175,098
  • 59
  • 401
  • 420
  • Thank you! Your answer works. The first statement is still very slow, but I guess there's just a lot of data to retrieve (1300 classes * 30 students). Though I am running this through MySql on localhost, so I'm puzzled that it takes so long. I'm going to post my new code (with your corrections) on Code Review for tweaking, I hope that's okay. Thanks again. – James Oct 16 '11 at 07:35
  • @James: I've added an Edit section to my answer about the poor performance. If you should try the alternative approach ("explicite loading") let me know the result, I'm very very interested in it because after several other questions I've seen here on SO I have the suspicion that eager loading is much more often a bad choice than one might think, especially in situations like yours with relatively long child collections. – Slauma Oct 16 '11 at 13:00
  • 1
    Fantastic! The original code took 167 sec, and your improved version takes 3.4 sec. I'm very impressed. It looks like there's not much of a performance hit for multiple database queries, which is good because I now have to add a whole load more to update references (I want to merge the duplicates by updating all foreign keys to point to the version we're keeping). – James Oct 16 '11 at 13:16
  • @James: WOW! That's an extreme improvement! I must bookmark your question for future reference. People are often scared of multiple DB queries and try to avoid it at all costs. But this shows that one must be actually more scared of eager loading. – Slauma Oct 16 '11 at 13:40
  • @James: By the way: Do you have an index on the foreign key column which refers from the Students table to Classes table (`ClassId` perhaps) in your `Students` table? For those 1300 queries it might have an additional benefit as these queries basically look up by this FK column. SQL Server does not create indices automatically on FK columns ( http://stackoverflow.com/questions/507179/does-foreign-key-improve-query-performance/507197#507197 ). I don't know if MySQL does it. – Slauma Oct 16 '11 at 13:54
  • Yes, MySql has already created indices for `StudentId` and `ClassId` in the `StudentInClasses` table. Thanks again for your help. – James Oct 16 '11 at 15:13