SQL for comparing datasets

Question

I frequently have a need to compare two datasets to see what rows are different. In Excel, with a small dataset, it's easy enough to concatenate the fields I want to compare, then countif() that new field matches the value of my other dataset row by row. But, when the dataset gets too large, it crashes Excel, of course. I've always thought it was a bit of a hack way of doing it, and I'd like to know a more appropriate way to do it in SQL.

So, here's a more visual example: Dataset Original

Abc, 123, 1/1/1900
A23, 4.5, 2/1/1900

New Comparison Dataset

Abc, 123, 1/1/1900
A23, 4.6, 2/1/1900

Dataset Original in Excel

Abc, 123, 1/1/1900, =concatenate(A1,B1,C1), =countif('New'!D:D,C1)
A23, 4.5, 2/1/1900, =concatenate(A2,B2,C2), =countif('New'!D:D,C2)

New Comparison Dataset in Excel

Abc, 123, 1/1/1900, =concatenate(A1,B1,C1), =countif('Original'!D:D,C1)
A23, 4.6, 2/1/1900, =concatenate(A2,B2,C2), =countif('Original'!D:D,C2)

If I filter by the last column in both sheets, I get the two different rows, and can see that one has 4.5 and the other has 4.6.

I'd like to not only be able to see the two sets of unmatched records, I need to also be able to compare them based on only certain fields/columns matching, not necessarily the whole row.

I have a feeling the answer is similar to using =sumproduct(--(field=value),--(field<value),range) which is another way I compare things in smaller Excel datasets.

Thanks for the assistance.

`...If I filter by the last column...` What do you meant by "filter"? — The Impaler, Aug 13 '18 at 21:17
@TheImpaler I'd like to know irrespective of syntax, because I'm dealing with different backends, but Transact-SQL is most likely. When I say filter, I mean using Excel's Autofilter that's been renamed Filter, that turns the header row into selectors, where 0 is stuff that doesn't match, and anything greater than that just tells me how many duplicates I have. — undrline - Reinstate Monica, Aug 13 '18 at 21:22
This might help with part of what I'm looking for: https://stackoverflow.com/questions/582637/sql-server-equivalent-of-a-countif-aggregate-function — undrline - Reinstate Monica, Aug 13 '18 at 21:25

Kirk Broadhurst · Answer 1 · 2018-08-14T15:08:52.487

1

If you're using SQL Server, the simplest way is using the EXCEPT keyword.

SELECT * FROM table1
EXCEPT
SELECT * from table2

The command is pretty well named - it selects everything from the first result except those rows that exist in the second result, i.e. it 'filters' everything that is a complete and exact match in the second.

If you want to perform that 'match' on only some columns, you could select only those columns within the EXCEPT query, and then join back to the original source to get the rest of the columns.

edited Aug 14 '18 at 15:08

answered Aug 13 '18 at 21:28

Kirk Broadhurst

27,836
16
104
169

That sounds like the straightforward kind of thing I'm looking for ... I've been seeing solutions that mess around with aggregates to try to get at it. I'm sure the "correct" way is a full outer join, but I couldn't figure out how to get the column-by-column match. I'll try it in the morning, thanks. – undrline - Reinstate Monica Aug 14 '18 at 02:05
It helped a lot. Thank you so much. But I can't seem to get the numbers to match up from what I can produce in Excel, which bothers me and I'm trying to figure out. It only solves my issue if I only want unmatched. If I want "In Original Only","In New Only" I can do the same in reverse. Then if I want "In Both" I can do an inner join, but can't get at duplicates. I'm looking to see if maybe this would help: https://forums.asp.net/post/4376229.aspx – undrline - Reinstate Monica Aug 14 '18 at 15:06
1

To get 'in both' try to `INTERSECT` keyword. It's another well named command - will return the rows that appear (exactly) in both queries. Don't forget to mark my answer correct if it works. – Kirk Broadhurst Aug 14 '18 at 15:08
1

At the risk of violating the chat rule, I want to indicate that I'm really busy, and won't get to look at this for a little while. I want to come up with a comprehensive answer and update my question for clarity and give you the opportunity to edit yours with all our back-and-forth, then mark this solved and have it be helpful for others. I appreciate the time and helpfulness, and don't want you to think I'm leaving you hanging. – undrline - Reinstate Monica Aug 16 '18 at 13:13
I think there's something I still need. If I want to compare only certain columns, but retain the rest record by record, I run into trouble, since EXCEPT and INTERSECT need to be normalized like a UNION and compare the rows the same way. So, if I have a foreign key column, it shows all those rows as different, though it's not the column I want to compare, and if I exclude the key column, I can't get it back in the end result. – undrline - Reinstate Monica Aug 20 '18 at 20:24
Re-reading, I see "join back to the original source" ... can you give an example of that? Here's the generic example I'm creating: https://pastebin.com/ckdvxKk6 – undrline - Reinstate Monica Aug 20 '18 at 20:49

SQL for comparing datasets

1 Answers1