I have a database with 2,200,000 records in. I do some work in a c# program which re-calculates the data for these 2,200,000 records and I want to insert it but not have any duplicates (i.e. if it generated the identical same data for a record, don't do anything). There are 9 fields that I check to see if they are the same, and if there is no difference (identical data) I don't do an insert.
I'm trying to think of a speedy way to do this. All I can think of is having a unique constraint across all 9 fields, but I feel this might be the wrong approach.
Previously I loaded the entire lot into the memory and did the comparison there, but since it has gone up to 2 mil+ records this has become impractical.
Any thoughts?
Bit more detail:
A service runs that collects some scientific data. If the data meets some criteria (not important) then it inserts a warning record into the database. There are 2 million warnings generated the first time the program runs.
The second time it runs, it collects the same data. But I only want records to be inserted if they are different to existing records. So I must compared the previous 2 million warnings with the 2 million new warnings. If they are the same, nothing happens. If they are different in any way, they are inserted as a new warning.
A scientist checks every warning one by one and flags them as important or not important. So they don't care about duplicate warnings which are exactly the same.
The data has 9 fields, some string some integer, and if any field changes at all, it must insert a new warning.
Questions:
Currently, the warnings are calculated one by one in a loop, then afterwards, inserted into the database in another loop.
A hash sounds like it might work, how can I create a hash of all the fields? Is this a c# manual operation or can I get SQL to do it automatically? (I will have to generate the same hash in code to stop duplicates I think, so I need to be able to create it in c# too).
Yes, if previous version is identical to current version then do not insert, e.g. I have the data "1, 125, abcdef, 33.4, chocolate" if I process a record and get "1, 125, abcdef, 33.4, chocolate" do not insert, if I get "1, 125, abcdef, 33.4, melon" then insert.