1

Suppose I have the following table in my SQL Server (2012) database:

MyTable:

Col1:      Col2:     Col3:     Col4:    VersionNum:
--------------------------------------------------
Val11      Val21     Val31     Val41    1
Val12      Val22     Val32     Val42    1
Val13      Val23     Val33     Val43    1
                 ...

And I have the following data (say about 20000 records) that I'd like to merge with my current table:

New Data:

Col1:      Col2:     Col3:     Col4:
------------------------------------
Val11      Val21     Val31     Val41
Val12a     Val22     Val32     Val42

So, the first value is an exact match to the first row, whereas the second value has at least one element different

What I would like to have my table end up looking like is:

MyTable:

Col1:      Col2:     Col3:     Col4:    VersionNum:
---------------------------------------------------
Val11      Val21     Val31     Val41    1
Val12      Val22     Val32     Val42    1
Val13      Val23     Val33     Val43    1
Val12a     Val22     Val32     Val42    2

Or, putting it into words:

  1. If all data elements match, then don't add in a new row
  2. If any one or more data elements are different, then add in a new row with an updated version number.

I found this question that seems to deal with a similar such issue, but only for a primary key difference. What I'm wondering is what would be the best / most efficient way to do this given such a large dataset to use for the merge? Or, if there are any better patterns (an archive table or something similar, PLEASE do share - This is all very new to me and I'd like to do things as cleanly and efficiently as possible).

Community
  • 1
  • 1
John Bustos
  • 19,036
  • 17
  • 89
  • 151
  • I think you need to clarify the question. Your second criterion says: "If any one or more data elements are different, then add in a new row with an updated version number", but every new row will be different from almost every existing row. Presumably you don't want to generate 20,000 new versions of every existing row? Is there a primary key you're not mentioning? – Mike Christie Aug 01 '16 at 18:57
  • @MikeChristie, that's my challenge - What they're asking for is to add in a row if ANY data-element is different, but not add in any row is they all match... No primary key. – John Bustos Aug 01 '16 at 19:05
  • 1
    Do you want the version number to be 2 on every single new row? And then the next time you do it, all new rows are verson 3? – Mike Christie Aug 01 '16 at 19:17
  • What if a new row can be seen as a new version of multiple existing rows? Do you want to create a new version for each of the existing rows that qualify? Or only for the closest match? – tomislav_t Aug 01 '16 at 20:11
  • @MikeChristie, that would be exactly it - in all truth, I'd actually just timestamp the rows, but didn't want to get into that as there's people out there that would comment about the timestamp rather than on my actual question. – John Bustos Aug 01 '16 at 20:33
  • @tomislav_t, for the sake of this question, I truly just want to compare my data against the most recent (highest version) for the existing data. – John Bustos Aug 01 '16 at 20:34

2 Answers2

1

One awesome method is using MERGE(). You can clarify in MERGE() what to do if there is a match (update, insert, delete) etc to ensure you don't have duplicates and only update or insert the records you want.

https://msdn.microsoft.com/en-us/library/bb510625.aspx

https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx

And a great example:

https://www.simple-talk.com/sql/learn-sql-server/the-merge-statement-in-sql-server-2008/

S3S
  • 24,809
  • 5
  • 26
  • 45
  • SCSimon, thanks so much - I am actually currently using a MERGE statement to do this, but I have to join them on every column and it seems like this should be pretty inefficient. Any thoughts?? – John Bustos Aug 01 '16 at 19:06
  • 1
    @JohnBustos that's an understandable concern but it would all depend on your schema, resources, indexes, etc. Are you noticing lengthy updates? I haven't had any issues in my environment but I do run the jobs daily, which for me limits the SOURCE table records a lot. – S3S Aug 01 '16 at 19:10
  • Well, this sucks, SCSimon, I did the MERGE only to find that it had an issue with NULL values... I'm now looking into maybe doing it using something along these lines: http://dba.stackexchange.com/a/122059/41136 – John Bustos Aug 01 '16 at 20:35
  • 1
    Did you ISNULL all columns? – S3S Aug 01 '16 at 20:36
  • I didn't - It seems like it would make it THAT much even less efficient.... I'm wondering if I'm just trying to make a square peg fit into a round hole. – John Bustos Aug 01 '16 at 20:43
  • 1
    Not at all @JohnBustos. In fact, Microsoft assumingly created MERGE with that in mind. They even state "The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it does not exist, or updating the row if it does match. When simply updating one table based on the rows of another table, improved performance and scalability can be achieved with basic INSERT, UPDATE, and DELETE statements." So it's designed to be used when complicating an insert or join statement would be slow – S3S Aug 01 '16 at 21:02
  • SCSimon, I actually already implemented it and it worked a freaking charm!!! - Fast, efficient (as can be expected) and did exactly what I wanted it to do!! YOU ARE AWESOME!!! - **THANK YOU** – John Bustos Aug 01 '16 at 21:05
0

I assume you want the version number to be 2 for all the new rows (or 1 more than the existing max versionNum), since that's the only way this seems to make sense. First get the merged set of records by using UNION without the ALL keyword:

select col1, col2, col3, col4
  from firstTable
union
select col1, col2, col3, col4
  from secondTable

Call this "query1". Now join to firstTable to mark up the original records:

select a.*,
       f.versionNum
  from (query1) a,
       firstTable f
 where a.col1 = f.col1
   and a.col2 = f.col2
   and a.col3 = f.col3
   and a.col4 = f.col4

This is query2. Anything with no versionNum is new and should have versionNum = 2. So:

select b.col1,
       b.col2,
       b.col3,
       b.col4,
       2 as versionNum
  from (query2) b
 where b.versionNum is null

are the records you want to insert.

If you need to make the versionNum one greater than the current max in the table, you can just do

(select max(versionNum) from firstTable) + 1

instead of 2 as the versionNum.

Mike Christie
  • 381
  • 1
  • 10
  • Interesting method, Mike, never thought of doing things this way - Do you have any idea on the efficiency of this vs a merge statement? – John Bustos Aug 01 '16 at 19:39
  • I suspect, with no proof at all, that if you do this with a merge statement the merge is likely to be quicker. Personally I tend to construct things this way unless I suspect performance is going to be an issue because building the query inside out like this makes it easier to read. You may not agree, of course! If you try it with a merge let me know -- I'd be interested in the performance results. – Mike Christie Aug 01 '16 at 19:43
  • This does not cover partial matches – tomislav_t Aug 01 '16 at 20:13
  • @tomislav_t can you explain why? Any record with a partial match will fail to pick up a version number on the union, and so will end up in the dataset to insert at the end. – Mike Christie Aug 01 '16 at 20:35
  • In the original example we had (Val12a,Val22,Val32,Val42) lead to version 2 of the record (Val12, Val22, Val32, Val42). If we would have one more existing record (Val10, Val22, Val32, Val42) with version number 3, what would be the version number of the newly added record? 4 or 2? – tomislav_t Aug 01 '16 at 21:29
  • Per the OP's comments above, all the records from this insertion will have version 2. There's no sensible way to give different versions because a record in the new table might have partial matches with multiple different records, which would imply it would need multiple different version numbers on the same record. – Mike Christie Aug 01 '16 at 21:59
  • Mike, I do believe your answer does exactly what I wanted and thank you - But after a bit of playing, the MERGE statement managed to meet my requirements with just less coding (and, I guess, being more in-sync with how Microsoft wants things like this done). But thank you for a very interesting solution that I'm sure will come in handy in the future (I almost never use just UNION without the ALL portion and it's good to remember the difference!!) – John Bustos Aug 02 '16 at 13:03