2

Got a kind of edge case issue here. I've been tasked with pulling all data from one database to another, where the destination database has a different schema.

I've chosen to write a WinForms utility to do the data mapping and transfer with Entity Framework/ADO.NET when necessary.

This has worked great so far, except for this one specific table that has 2.5 million records. The transfer is about 10 minutes total when I disregard all foreign keys, however when I start mapping foreign keys with FirstOrDefault() calls against in memory lists of data that have been already moved to the destination database, quite literally 4 days are added to the amount of time that it takes.

I'm going to need to run this tool a lot over the coming days so this isn't really acceptable for me.

Here's my current approach (Not my first approach, this is the result of much trial and error for efficiencies sake):

private OldModelContext _oldModelContext { get; } //instantiated in controller

using (var newModelContext = new NewModelContext())
    {
        //Takes no time at all to load these into memory, collections are small, 3 - 20 records each
        var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToList();
        var alreadyMigratedTable2 = newModelContext.alreadyMigratedTable2.ToList();
        var alreadyMigratedTable3 = newModelContext.alreadyMigratedTable3.ToList();
        var alreadyMigratedTable4 = newModelContext.alreadyMigratedTable4.ToList();
        var alreadyMigratedTable5 = newModelContext.alreadyMigratedTable5.ToList();

        var oldDatasetInMemory = _oldModelContext.MasterData.AsNoTracking().ToList();//2.5 Million records, takes about 6 minutes 

        var table = new DataTable("MasterData");
        table.Columns.Add("Column1");
        table.Columns.Add("Column2");
        table.Columns.Add("Column3");
        table.Columns.Add("ForeignKeyColumn1");
        table.Columns.Add("ForeignKeyColumn2");
        table.Columns.Add("ForeignKeyColumn3");
        table.Columns.Add("ForeignKeyColumn4");
        table.Columns.Add("ForeignKeyColumn5");

        foreach(var masterData in oldDatasetInMemory){
            DataRow row = table.NewRow();

            //With just these properties mapped, this takes about 2 minutes for all 2.5 Million
            row["Column1"] = masterData.Property1;
            row["Column2"] = masterData.Property2;
            row["Column3"] = masterData.Property3;

            //With this mapping, we add about 4 days to the overall process.
            row["ForeignKeyColumn1"] = alreadyMigratedTable1.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
            row["ForeignKeyColumn2"] = alreadyMigratedTable2.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
            row["ForeignKeyColumn3"] = alreadyMigratedTable3.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
            row["ForeignKeyColumn4"] = alreadyMigratedTable4.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
            row["ForeignKeyColumn5"] = alreadyMigratedTable5.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);

            table.Rows.Add(row);
        }   

        //Save table with SQLBulkCopy is very fast, takes about a minute and a half.
    }
}

Note: uniquePropertyOn(New/Old)Dataset is most often a unique description string shared among the datasets, can't match Ids as they won't be the same across databases.

I have tried:

  1. Instead of using a foreach, cast with a linq select statement, not much improvement was had.
  2. Use .Where(predicate).FirstOrDefault(), didn't see any considerable improvement
  3. Running FirstOrDefault() against iqueryable instead of lists of migrated data, didn't see any improvement.
  4. Mapping to a List instead of a datatable, but that makes no difference in the mapping speed, and also makes bulk saves slower.

I've been messing around with the idea of turning the foreach into a parallel foreach loop and locking the calls to the datatable, but I keep running into

Entity Framework connection closed issues

when querying the in memory lists while using the parallel foreach.... not really sure what that's about but initially the speed results were promising.

I'd be happy to post that code/errors if anyone thinks it's the right road to go down, but i'm not sure anymore..

affanBajwa
  • 158
  • 3
  • 13
Halter
  • 48
  • 2
  • 6
  • 30
  • If it's really a "unique" property, why not use a Dictionary, where access by key is basically `O(1)` instead of a List, where access is `O(n)`? So instead of `ToList()` use `ToDictionary(item => item.UniqueProperty)`. – Corak Dec 06 '18 at 09:30
  • @Corak that... may actually work shit I never thought about that – Halter Dec 06 '18 at 09:33

2 Answers2

3

The first thing I'd try is a dictionary, and pre-fetching the columns:

var fk1 = oldDatasetInMemory.Columns["ForeignKeyColumn1"];

// ...

var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToDictionary(
    x => x.uniquePropertyOnNewDataset);

// ...

if (alreadyMigratedTable1.TryGetValue(masterData.uniquePropertyOnOldDataset, out var val))
    row[fk1] = val;

However, in reality: I'd also try to avoid the entire DataTable piece unless it is really, really necessary.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • So, dictionaries *are* faster than the lists, but still entirely too slow to iterate the 2.5 Million result set. I'm starting to wonder if something is wrong with my project itself... – Halter Dec 06 '18 at 10:24
  • @Halter perhaps... for reference: I work on 16M+ row in-memory chunks (I maintain the "tag engine" that we use here at Stack Overflow), and most of my large rebuild operations take in the order of hundreds of milliseconds (so: <1s). I do not, however, use `DataTable` at all. It is virtually impossible to guess where it will be snagging – Marc Gravell Dec 06 '18 at 10:33
  • Understood, I'm using the datatable for sqlbulkcopy as i couldn't find a good way to speed up bulk inserts in EF without throwing money at it with a 3rd party library. I'll try the dictionaries again with mapping to `newModelContext.MasterData` instead of the `DataTable` and see how that goes – Halter Dec 06 '18 at 10:40
  • @Halter tip: FastMember has an `ObjectReader` that works with `SqlBulkCopy`; see the example at the bottom here: https://github.com/mgravell/fast-member (and it is free) – Marc Gravell Dec 06 '18 at 10:42
  • 1
    Just tried the change to EF with dictionaries, clocked in at the exact same amount of time that it took using Rangos solution: `Time elapsed: 00:02:02.6280037`. Really just at a loss for words here. – Halter Dec 06 '18 at 11:06
2

If there is really no other way to migrate this data than to load everything into memory, you can make it more efficient by avoiding this nested loop and by linking the lists via Join.

Read: Why is LINQ JOIN so much faster than linking with WHERE?

var newData =
    from master in oldDatasetInMemory
    join t1 in alreadyMigratedTable1
        on master.uniquePropertyOnOldDataset equals t1.uniquePropertyOnNewDataset into t1Group
    from join1 in t1Group.Take(1).DefaultIfEmpty()
    join t2 in alreadyMigratedTable2
        on master.uniquePropertyOnOldDataset equals t2.uniquePropertyOnNewDataset into t2Group
    from join2 in t2Group.Take(1).DefaultIfEmpty()
    join t3 in alreadyMigratedTable3
        on master.uniquePropertyOnOldDataset equals t3.uniquePropertyOnNewDataset into t3Group
    from join3 in t1Group.Take(1).DefaultIfEmpty()
    join t4 in alreadyMigratedTable4
        on master.uniquePropertyOnOldDataset equals t4.uniquePropertyOnNewDataset into t4Group
    from join4 in t1Group.Take(1).DefaultIfEmpty()
    join t5 in alreadyMigratedTable5
        on master.uniquePropertyOnOldDataset equals t5.uniquePropertyOnNewDataset into t5Group
    from join5 in t1Group.Take(1).DefaultIfEmpty()
    select new { master, join1, join2, join3, join4, join5};

foreach (var x in newData)
{
    DataRow row = table.Rows.Add();
    row["Column1"] = x.master.Property1;
    row["Column2"] = x.master.Property2;
    row["Column3"] = x.master.Property3;
    row["ForeignKeyColumn1"] = x.join1;
    row["ForeignKeyColumn2"] = x.join2;
    row["ForeignKeyColumn3"] = x.join3;
    row["ForeignKeyColumn4"] = x.join4;
    row["ForeignKeyColumn5"] = x.join5;
}

This is a LINQ Left-Outer-Join which takes only one row from the right side.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • I think this may be the best solution, the only issue i'm having with it is accessing `newData.master.AnyProperty` is *slow*, just as slow as the other methods. It seems to trade the speed lost from the queries for speed loss in accessing that nested object? – Halter Dec 06 '18 at 10:23
  • @Halter: `newData.master.AnyProperty` should not compile because `newData` is a query and not a single object. Have you used the `foreach` loop that i have posted? – Tim Schmelter Dec 06 '18 at 10:34
  • For a little more perspective, the foreach loop (just mapping properties, not foreign keys) off of `masterData` (not `newData`) takes `Time elapsed: 00:00:04.0134174` The foreach loop with `newData` and the properties mapped off of `x.master` takes `Time elapsed: 00:02:02.6280037` Mapped with stopwatch – Halter Dec 06 '18 at 10:36
  • Yes sorry, I was using "`AnyProperty`" generically. – Halter Dec 06 '18 at 10:38