0

I have a database with 2,200,000 records in. I do some work in a c# program which re-calculates the data for these 2,200,000 records and I want to insert it but not have any duplicates (i.e. if it generated the identical same data for a record, don't do anything). There are 9 fields that I check to see if they are the same, and if there is no difference (identical data) I don't do an insert.

I'm trying to think of a speedy way to do this. All I can think of is having a unique constraint across all 9 fields, but I feel this might be the wrong approach.

Previously I loaded the entire lot into the memory and did the comparison there, but since it has gone up to 2 mil+ records this has become impractical.

Any thoughts?

Bit more detail:

A service runs that collects some scientific data. If the data meets some criteria (not important) then it inserts a warning record into the database. There are 2 million warnings generated the first time the program runs.

The second time it runs, it collects the same data. But I only want records to be inserted if they are different to existing records. So I must compared the previous 2 million warnings with the 2 million new warnings. If they are the same, nothing happens. If they are different in any way, they are inserted as a new warning.

A scientist checks every warning one by one and flags them as important or not important. So they don't care about duplicate warnings which are exactly the same.

The data has 9 fields, some string some integer, and if any field changes at all, it must insert a new warning.

Questions:

Currently, the warnings are calculated one by one in a loop, then afterwards, inserted into the database in another loop.

A hash sounds like it might work, how can I create a hash of all the fields? Is this a c# manual operation or can I get SQL to do it automatically? (I will have to generate the same hash in code to stop duplicates I think, so I need to be able to create it in c# too).

Yes, if previous version is identical to current version then do not insert, e.g. I have the data "1, 125, abcdef, 33.4, chocolate" if I process a record and get "1, 125, abcdef, 33.4, chocolate" do not insert, if I get "1, 125, abcdef, 33.4, melon" then insert.

NibblyPig
  • 51,118
  • 72
  • 200
  • 356
  • 2
    Create a hash for the 9 fields? –  Feb 02 '15 at 11:15
  • Do all of the records have to be recalculated at the same time, or can they be processed one at a time, or in batches? – BateTech Feb 02 '15 at 11:16
  • You mean to say if previous version and calculated version is same then do not insert? – Amit Feb 02 '15 at 11:16
  • Have answered the questions, thank you - maybe a hash sounds like a good idea, but I'm not sure how to implement it. – NibblyPig Feb 02 '15 at 11:21
  • Also I agree with Tim, in one place you say if they are the same do not insert, and then you say " and if there is any difference I don't do an insert". Can you clarify? I think you meant if the same do not insert and if there is any difference then insert. – BateTech Feb 02 '15 at 11:22
  • Sorry, my mistake, I have clarified it :) – NibblyPig Feb 02 '15 at 11:23

1 Answers1

1

Use hashes.

I had a similar issue while implementing "Delta Imports" for a DWH I have at a customer. The easiest and fastest way is to compare the hashes of both data sets. Each record has a uid (uniqueidentifier) and a hash (nvarchar(400)) column. The hash is calculated just before inserting it into the respective table in the database.

With each new import (i.e. calculation in your scenario), the hash will be calculated again and compared to the existing hash (with the same uid). The record will be skipped it both match and will be updated if both differ.

Moslem Ben Dhaou
  • 6,897
  • 8
  • 62
  • 93
  • I like it, may I ask how you calculated the hash? Did you just concat all the columns and use md5? I wonder if there is a proper way, because maybe the data will be "1, 15, cats" or "11, 5, cats" which would md5 the same. – NibblyPig Feb 02 '15 at 11:21
  • I did it in SQL Server directly using a computed column which calls `BINARY_CHECKSUM` (https://msdn.microsoft.com/en-us/library/ms173784.aspx). It should be easy to call it from C# – Moslem Ben Dhaou Feb 02 '15 at 11:30
  • I would avoid md5 as it has a relatively higher collision probability. `BINARY_CHECKSUM` has a 1 in 4 billion collision chance – Moslem Ben Dhaou Feb 02 '15 at 11:31
  • I see, I am not sure how I can call BINARY_CHECKSUM in c#, as in pure c# code, if I have to make 2 million queries to SQL it would be very slow. Do you think it might be possible? – NibblyPig Feb 02 '15 at 11:47
  • Actually, it might be better to do all the hashing in C# and insert the hash with the data. I guess I just need a reliable way to take 9 pieces of data in c# and create a unique hash! – NibblyPig Feb 02 '15 at 11:49
  • Exactly. As I told you I used a computed column. In your case you will have to switch the hash generation to C# otherwise you will get a serious performance hit. – Moslem Ben Dhaou Feb 02 '15 at 15:48