Is it possible to select huge list from db by list of composite primary keys

Question

I have 2 databases. From my DB I'am taking List items (im getting this by Date, it can be up to 300 000 elements)

public class Item
{
    public string A { get; set; }
    public string B { get; set; }
    public string C { get; set; }
    public DateTime Date { get; set; }
}

In other database (I don't control that DB, I can olny read from it, and i can't change anything in this DB) I need to select List

public class OtherDbItem
{
    public string X { get; set; }
    public string Y { get; set; }
    public string Z { get; set; }
    public string FewOtherProperties { get; set; }
}

Where X, Y, Z are primary key, I need to select all otherDbItems where Item.A = OtherDbItem.X and Item.B = OtherDbItem.Y and Item.C = OtherDbItem.Z (than map OtherDbItems to my model and save in my database).

I am using 2 different EF Core DbContext for connecting with databases.

I tryed:

        var otherDbItems = new List<OtherDbItem>();

        foreach (var item in Items)
        {
            var otherDbItem = await this.context.OtherDbItems.FindAsync(item.A, item.B, item.C);

            if (otherDbItem != null)
            {
                otherDbItems.Add(otherDbItem);
            }
        }

        return otherDbItems;

But this can be 300 000 Items, so it's 300 000 requests to database, obviusly it's not optimal, and not acceptable.

I tryed also:

        var ids = items.Select(item => item.A + item.B + item.C).ToList();
        var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => ids.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();

But this result in huge sql query, it's slow, and cause ConnectionTimeOut.

Is it possible to get OtherDbItems fast and relaiable? And do I have to get this item's in parts? For example .take(1000).skip(0) items at 1 call? If yes how big should this parts be?

It would make a great difference if you can add a computed, persisted, indexed field for the concatenated keys, and then use a chunked `Contains` method, like [this one](https://stackoverflow.com/a/24542812/861716). — Gert Arnold, Jun 14 '21 at 18:07
Why do you think item.A + item.B + item.C is unique? '1'+'55'+'9' = '15'+'5'+'9' = '1'+'5'+'59' = '1559' — Antonín Lejsek, Jun 14 '21 at 23:30
On a different thought - what if you pull the data from other db and insert into a new Flat table (without any indexes) using Bulk insert in main DB and then run the query using using the new flat table and required table directly by EF core (or you can use SQL too). You can use the `Join` directly on both tables. — user1672994, Jun 15 '21 at 08:17
@LawrenceJohnson where exactly am I speaking about integers? — Antonín Lejsek, Jun 15 '21 at 09:07
@AntonínLejsek They are unique, it's real life example, but i changed names and omit validation rules for clarification (unique is not a problem in my case, so I didn't mention it). — garret, Jun 15 '21 at 09:24
@AntonínLejsek wrapping integers in quotes to make them a string would be a poor assumption as to why someone would use strings for keys. — Lawrence Johnson, Jun 15 '21 at 16:17

Lawrence Johnson · Answer 1 · 2021-06-14T18:07:27.397

I can't say for sure that this is the best approach because I'm not an EF expert, but I had a similar scenario recently where I was dealing with a sync that came from an external JSON export to an EF Core database. Part of that operation was validating that existing EF Core entries that would grow based on the imported data were still valid if the export changed, suffice to say as the database grew towards a million or so records that had to be validated, we encountered timeout and expensive query issues.

The approach we ended with that actually ended up improving the speed of even our original process was to batch the operations. The one thing we did different than just the take()skip() approach was we actually batched on the input side. In essence, we took a collection of 1000 ids at a time and used that for the query before moving onto the next. So with your code/data that might look something like this:

int chunkIndex = 0;
int batch = 1000;
var ids = items.Select(item => item.A + item.B + item.C).ToList();
while (chunkIndex < ids.Count)
{
    var chunkIDs = ids.GetRange(chunkIndex,
        chunkIndex + batch >= ids.Count ? ids.Count - chunkIndex : batch);

    var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => chunkIDs.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();

    chunkIndex += batch;
}

So I think this makes your query less expensive since it isn't having to run the entire thing and then limiting the result, but where your situation is slightly different is that your source is also a database whereas ours was JSON content. You could probably further optimize this by using the take() approach on your query of of ids in the Items source table. The syntax on this might not be 100% right, but perhaps this gives the idea:

int chunkIndex = 0;
int batch = 1000;
// Update dbItemsContext.Items to your source context and table
int totalRecords = dbItemsContext.Items.Count();
while (chunkIndex < totalRecords)
{
    // Update dbItemsContext.Items to your source context and table
    var chunkIDs = dbItemsContext.Items.Select(item => item.A + item.B + item.C).Take(batch).Skip(chunkIndex).ToList();

    var otherDbItems = await this.context.OtherDbItems.Where(otherDbItem => chunkIDs.Contains(otherDbItem.X + otherDbItem.Y + otherDbItem.Z)).ToListAsync();

    chunkIndex += batch;
}

I hope that helps demonstrate our approach, but I think this route you'd need to lock the tables to avoid changes until your operations are complete. I welcome any feedback since it could improve our process as well. I'll also note that our application/context is not setup to run async so you might need some additional modifications or could possibly even have these batches run asynchronously for your use case.

Final note in regards to batch size: you may need to play with it a bit. Our query was quite a bit more complex so 1000 seemed to be the sweet spot for us, but you may be able to take quite a bit more at a time. I'm not sure there's any other way to determine the best batch size without just testing some different sizes.

score 0 · Accepted Answer · answered Jun 15 '21 at 18:48

Ok, it was much easier than i thought. Both databases are in the same SQL server, so it was mater of simple inner join.

I just added properties from Item to OtherDbItem

public class OtherDbItem
{
    public string A { get; set; }
    public string B { get; set; }
    public string C { get; set; }
    public DateTime Date { get; set; }
    public string X { get; set; }
    public string Y { get; set; }
    public string Z { get; set; }
    public string FewOtherProperties { get; set; }
}

And in OnModemCreating:

    protected override void OnModelCreating(ModelBuilder builder)
    {
        builder.Entity<OtherDbItem>(
            entity =>
            {
                entity.ToSqlQuery(@"SELECT
i.A,
i.B,
i.C,
i.Date,
o.X,
o.Y,
o.Z,
o.FewOtherProperties
  FROM [DB1].[dbo].[Items] i
  inner join [DB2].[dbo].[OtherDbItem] o on i.A = o.X and i.B = o.Y and i.C = o.Z");
                entity.HasKey(o => new { o.X, o.Y, o.Z});
            });
    }

And last thing to do:

    {
        return this.context.OtherDbItems
            .Where(x => x.Date == date)
            .Distinct()
            .ToListAsync();
    }

Is it possible to select huge list from db by list of composite primary keys

2 Answers2