-1

In the project, I need to call an external API based on time. So, for one day, I may need to call the API 24 times, one call for one hour period. The API result is a XML file which has 6 fields. I will need to insert these data into a table. Averagely, for each hour, it has about 20,000 rows data.

The table has these 6 columns:

col1, col2, col3, col4, col5, col6

When all 6 columns are the same, we consider the rows are the same, and we should not insert duplications.

I'm using C# and Entity Framework for this:

foreach (XmlNode node in nodes)
{
    try
    {
        count++;

        CallData data = new CallData();
        ...
        // get all data and set in 'data'

        // check whether in database already                        
        var q = ctx.CallDatas.Where(x => x.col1 == data.col1
                    && x.col2 == data.col2
                    && x.col3 == data.col3
                    && x.col4 == data.col4
                    && x.col5 == data.col5
                    && x.col6 == data.col6
                ).Any();
        if (q)
        {
            // exists in database, skip
            // log info
        }
        else
        {
            string key = $"{data.col1}|{data.col2}|{data.col3}|{data.col4}|{data.col5}|{data.col6}";
            // check whether in current chunk already
            if (dic.ContainsKey(key))
            {
                // in current chunk, skip
                // log info
            }
            else
            {
                // insert
                ctx.CallDatas.Add(data);

                // update dic
                dic.Add(key, true);
            }
        }
    }
    catch (Exception ex)
    {
        // log error
    }
}
Logger.InfoFormat("Saving changes ...");
if (ctx.ChangeTracker.HasChanges())
{
    await ctx.SaveChangesAsync();
}
Logger.InfoFormat("Saving changes ... Done.");

The code works fine. However, we will need to use this code to run for past several months. The issue is: the code runs slow since for each row it will need to check whether it exists already.

Is there any suggestions to improve the performance?

Thanks

urlreader
  • 6,319
  • 7
  • 57
  • 91
  • Have you tried running the equivalent SQL, and see if that's the problem? Look into whether you can index those fields that are using in the query. – Zac Faragher Feb 22 '19 at 00:46
  • yes, these 6 columns have an index. – urlreader Feb 22 '19 at 01:02
  • 1
    Firstly you need to know where the problem is. Run SQL server profiler and have a look how long does the query and the inserts really take. – Antonín Lejsek Feb 22 '19 at 02:01
  • `string key = $"{data.col1}|{data.col2}|{data.col3}|{data.col4}|{data.col5}|{data.col6}"; // check whether in current chunk already if (dic.ContainsKey(key))` This logic should be **before** the database call. If there are two ways to skip (the expensive DB lookup way, or the cheap dictionary lookup way) do the **cheap** check first. – mjwills Feb 22 '19 at 04:19
  • Also consider doing a `bulk insert` into a staging table, and then copying from that table into the main table. Bulk insert is much faster than row by row inserts. – mjwills Feb 22 '19 at 05:00

3 Answers3

0

You don't show the code on when the context is created or the life-cycle. I'm inclined to point you to your indexes on the table. If these aren't primary keys then you might see the performance issue there. If you are doing full table scans, it will be progressively slower. With that said, there are two separate ways to handle the

The EF Native way: You can explicitly create a new connection on each interaction (avoiding change tracking for all entries reducing progressive slowdown). Also, your save is async but your *Any statement is sync. Using async for that as well might help take some pressure off the current thread if it's waiting.

// Start your context scope closer to the data call, as if the look is long 
// running you could be building up tracked changes in the cache, this prevents 
// that situation.
using (YourEntity ctx = new YourEntity())
{
    CallData data = new CallData();
    if (await ctx.CallDatas.Where(x => x.col1 == data.col1
        && x.col2 == data.col2
        && x.col3 == data.col3
        && x.col4 == data.col4
        && x.col5 == data.col5
        && x.col6 == data.col6
        ).AnyAsync()
        )
    { 
        // exists in database, skip
        // log info
    }
    else
    {
        string key = $"{data.col1}|{data.col2}|{data.col3}|{data.col4}|{data.col5}|{data.col6}";
        // check whether in current chunk already
        if (dic.ContainsKey(key))
        {
            // in current chunk, skip
            // log info
        }
        else
        {
            // insert
            ctx.CallDatas.Add(data);
            await ctx.SaveChangesAsync();
            // update dic
            dic.Add(key, true);
        }
    }
}

Optional Way: Look into inserting the data using a bulk operation via store procedure. 20k rows is trivial, and you can still use entity framework for that as well. See https://stackoverflow.com/a/9837927/1558178

I have created my own version of this (customized for my specific needs) and have found that it works well and give more control for bulk inserts.

I have used this ideology to insert 100k records at a time. I have my logic in the stored procedure for checking for duplicates which gives me better control as well as reducing the over the wire call to 0 reads and 1 write. This should just take a second or two to execute assuming your stored procedure is optimized.

Gary Smith
  • 61
  • 1
  • 10
  • 1
    It doesn't even have to be a stored procedure; you can execute arbitrary SQL in EF. See https://learn.microsoft.com/en-us/ef/ef6/querying/raw-sql – Robert Harvey Feb 22 '19 at 01:21
  • Understood, but I haven't seen a working model of table value parameter in a raw SQL scenario. I'm also not a fan of hard coded SQL in the code (so I tend to only look at stored procedures -- so it might exist). – Gary Smith Feb 22 '19 at 01:27
0

Different approach:

Save all rows with duplicates - should be very efficient

When you use data from the table use DISTINCT for all fields.

Fabio
  • 31,528
  • 4
  • 33
  • 72
0

For raw, bulk operations like this I would consider avoiding EF entities and context tracking and merely execute SQL through the context:

var sql = $"IF NOT EXISTS(SELECT 1 FROM CallDates WHERE Col1={data.Col1} AND Col2={data.Col2} AND Col3={data.Col3} AND Col4={data.Col4} AND Col5={data.Col5} AND Col6={data.Col6}) INSERT INTO CallDates(Col1,Col2,Col3,Col4,Col5,Col6) VALUES ({data.Col1},{data.Col2},{data.Col3},{data.Col4},{data.Col5},{data.Col6})";
context.Database.ExeculeSqlCommand(sql);

This does without the extra checks and logging, just effectively raw SQL with duplicate detection.

Steve Py
  • 26,149
  • 3
  • 25
  • 43