How to fast reduce a datatable according to existing entries in a DB

Question

I'm currently trying to bulkinsert a datatable into a database. It works fine and fast. The only problem occurs if there are any rows that are already in the database (duplicate key).

To counter this I have modified my program so that I first check for each new entry if it already exists in the database or not. Which is.......slow (In the current cases I don't have many entries but later on its over 200k entries that I need to check and that a few times). Thus I need to make it faster as it is now (if possible).

The datatable is structured this way:

DataTable transactionTable.Columns.Add("DeviceId", typeof(Int32));
transactionTable.Columns.Add("LogDate", typeof(DateTime));
transactionTable.Columns.Add("LogType", typeof(Int32));
transactionTable.Columns.Add("LogText", typeof(String));

transactionTable.PrimaryKey = new DataColumn[3] {
    transactionTable.Columns[0],
    transactionTable.Columns[1],
    transactionTable.Columns[2]
};

What I have so far is the following:

DataTable insertTable = transactionTable.Copy();
insertTable.Clear();
using (SqlConnection sqlcon = new SqlConnection(this.GetConnString()))
{
    sqlcon.Open();
    foreach (var entry in transactionTable.AsEnumerable())
    {
        using (SqlCommand sqlCom = sqlCon.CreateCommand())
        {
            sqlCom.Parameters.Clear();
            sqlCom.CommandText = "SELECT 1 FROM myTable WHERE"
                    + " DeviceId = @DeviceId AND LogDate = @LogDate"
                    + " AND LogType = @LogType"
            sqlCom.Parameters.AddWithValue("@DeviceId", entry.Field<Int32>("DeviceId"));
            sqlCom.Parameters.AddWithValue("@LogDate", entry.Field<DateTime>("LogDate"));
            sqlCom.Parameters.AddWithValue("@LogType", entry.Field<Int32>("LogType"));

            using (SqlDataREader myRead = sqlCon.ExecuteReader()
            {
                myRead.Read();

                if (myRead.HasRows == false)
                {
                    insertTable.Rows.Add(entry.ItemArray);
                }
            }

        }
    }
}

// And afterwards the bulkinsert which I think is out of scope for the question itself 
// (I use the insertTable there)

Now my question is: Is there any way to do this faster in order to not get the key violation problem?

Have a look at `SQL Merge` (https://msdn.microsoft.com/en-GB/library/bb510625.aspx) — James Brierley, Oct 20 '15 at 16:12
In case your database is on another system/server then reducing the number of calls to database in the iteration may help you boost the performance on slow network connections i.e. fetch related records from database in one go and compare the disconnected data — haraman, Oct 21 '15 at 13:12
@harman As its logentries that just have a date and a type and I can't say that this type comes after this,... there is nothing to go by in regards to related data. Thus sadly nothing in that direction that I see — Thomas, Oct 21 '15 at 13:44
@JamesBrierley what do you mean there with sql merge? (my problem is that I have a datatable that I put into the database with a bulk copy and that bulk copy fails automatically if it runs into duplicate keys. So not sure what a sql merge statement could help there?) — Thomas, Oct 21 '15 at 13:46
I was thinking you could use a solution similar to http://stackoverflow.com/a/9649040/5111146, but I don't know much about sql merge myself — James Brierley, Oct 21 '15 at 14:13
@Thomas You are comparing on three columns `DeviceId`, `LogDate` and `LogType` using `AND`. Select any one column where you expect least number of records in existing data such as only `DeviceId`. Bring in one go, only those `DeviceId` records from server to client which are in `insertTable`. Then on client you can make comparison with all three columns and take required action. — haraman, Oct 21 '15 at 14:26
@haraman problem is that device id alone is about 100-200k entries. per day its about 5k-20k entries and although most parsings are for full days it can be that I'm parsing a bit over a day even. So even if I'm getting those 5-20k its then a datatable / list check of 5-20k still. — Thomas, Oct 22 '15 at 05:55
and in the worst case I'm processing the whole data at once (thus the initial data set contains enough data and dates that all 100-200k entries would be eligible). — Thomas, Oct 22 '15 at 06:06
@Thomas, are you using MS SQL Server? Which version? Please add corresponding tags to the question instead of `database`. — Vladimir Baranov, Oct 22 '15 at 23:17
@VladimirBaranov I'm not using the database tag. I'm using the datatable tag (which is a data type in C#) but yes can add the mssql server version also — Thomas, Oct 23 '15 at 05:31
@Thomas, sorry, I misread the `datatable` tag. I added a tag `sql-server` instead of `optimization`. I think that with this tag more people will see the question. — Vladimir Baranov, Oct 23 '15 at 06:20

score 2 · Accepted Answer · answered Oct 22 '15 at 09:26

In this case I would use some staging table. Here is some steps:

Bulk insert into staging table(using SqlBulkCopy)
Inserting into base table using stored proc with left join to eliminate existing rows
Truncate staging table

So you will need to delete foreach statement in your code, add stored proc for inserting to base table, add stored proc for truncating. Or you can combine last 2 steps in one.

score 2 · Answer 2 · answered Oct 23 '15 at 06:18

I have a similar set up.

I'm using a stored procedure with a Table-Valued parameter and MERGE statement. See also Table-Valued Parameters for example how to use them in .NET.

I would shift the focus of the problem from simple bulk insert to merging a batch of rows into a table with existing data.

Destination table

CREATE TABLE [dbo].[MyTable](
    [DeviceId] [int] NOT NULL,
    [LogDate] [datetime] NOT NULL,
    [LogType] [int] NOT NULL,
    [LogText] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED 
(
    [DeviceId] ASC,
    [LogDate] ASC,
    [LogType] ASC
))

Create user-defined table type

CREATE TYPE [dbo].[MyTableType] AS TABLE(
    [DeviceId] [int] NOT NULL,
    [LogDate] [datetime] NOT NULL,
    [LogType] [int] NOT NULL,
    [LogText] [nvarchar](50) NOT NULL,
    PRIMARY KEY CLUSTERED 
(
    [DeviceId] ASC,
    [LogDate] ASC,
    [LogType] ASC
))

Test and measure whether specifying PRIMARY KEY for the TYPE makes overall process faster or slower.

Stored procedure with TVP

CREATE PROCEDURE [dbo].[MergeMyTable]
    @ParamRows dbo.MyTableType READONLY
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        MERGE INTO dbo.MyTable as Dest
        USING
        (
            SELECT
                TT.[DeviceId],
                TT.[LogDate],
                TT.[LogType],
                TT.[LogText]
            FROM
                @ParamRows AS TT
        ) AS Src
        ON 
            (Dest.[DeviceId] = Src.[DeviceId]) AND
            (Dest.[LogDate]  = Src.[LogDate]) AND
            (Dest.[LogType]  = Src.[LogType])
        WHEN MATCHED THEN 
        UPDATE SET 
            Dest.[LogText] = Src.[LogText]
        WHEN NOT MATCHED BY TARGET THEN 
        INSERT
            ([DeviceId]
            ,[LogDate]
            ,[LogType]
            ,[LogText])
        VALUES
            (Src.[DeviceId],
            Src.[LogDate],
            Src.[LogType],
            Src.[LogText]);

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END

Call this stored procedure passing it a batch of rows to merge. Test and measure how performance changes with the size of the batch. Try batches with 1K, 10K, 100K rows.

If you never want to update existing rows with new values, remove the WHEN MATCHED THEN part of the MERGE, it will work faster.

I'm quite interested in this methode vs SqlBulkCopy performance. Also for lesser rows, as a possible standard way of multi row insert/updates for applications. Especially on tables that might have triggers. — Edwin Stoteler, Oct 23 '15 at 06:52
For usage with C# am I correct there that I bulkcopy into the user defined table and then use the above procedure to merge the results into the normal table ? — Thomas, Oct 23 '15 at 06:59
@Thomas, no. There is no bulk copy here. See the example in MSDN how to call such stored procedure with TVP. This link is in the answer: https://msdn.microsoft.com/en-us/library/bb675163(v=vs.110).aspx — Vladimir Baranov, Oct 23 '15 at 07:02
Ah so I can use a normal sqlcommand to execute the procedure and populate the whole thing in one go from a datatable. Didn't know that (always thought sqlbulkcopy is the only way there). tnx — Thomas, Oct 23 '15 at 07:05
@EdwinStoteler, the link to MSDN in the answer discusses [Table-Valued Parameters vs. BULK INSERT Operations](https://msdn.microsoft.com/en-AU/library/bb510489.aspx). I personally never used bulk copy. In my database stored procedures with TVP are fast enough. The largest table that I `INSERT` into using this approach is not too big - right now it has 120M rows for 7 days worth of data, so every day there are ~17M rows inserted. That's on average, on peak days it reaches 45M rows. This is a table for simple `INSERT`, not `MERGE` though. — Vladimir Baranov, Oct 23 '15 at 07:13
TVP data is passed to SQL Server using the same internal insert bulk technique as SqlBulkCopy. The difference is the destination is an internal temp table managed by SQL Server that exposes the TVP data. Furthermore, passing the TVP as `IEnumerable` can stream TVP data without loading the entire source dataset into memory at once. — Dan Guzman, Oct 24 '15 at 21:37

score 2 · Answer 3 · answered Oct 23 '15 at 18:09

2

You can drop and recreate your index with the IGNORE_DUP_KEY set to ON. Something like this:

ALTER TABLE datatable
ADD CONSTRAINT PK_datatable 
PRIMARY KEY CLUSTERED (DeviceId,LogDate,LogType,LogText) 
WITH (IGNORE_DUP_KEY = ON)

What this option does is report a duplicate key error with a different severity and message when any duplicate inserts for the index are attempted. It will not allow duplicates to be entered, but it will continue to insert all the records that are not duplicates and only give a warning message if duplicates were found and ignored.

More info at this link: Creating Unique Indexes.

answered Oct 23 '15 at 18:09

Brian Pressler

6,653
2
19
40

+1. I completely forgot about this option. With this option simple bulk insert may work faster than `MERGE`. The server still has to do pretty much the same as with `MERGE`, but these internal operations may be more optimized. It is worth trying and measuring the time. – Vladimir Baranov Oct 24 '15 at 10:09
Plus you can eliminate the need for a staging table... you can just bulk copy directly into the destination table. – Brian Pressler Oct 26 '15 at 14:58

How to fast reduce a datatable according to existing entries in a DB

3 Answers3