4

I have a need to identify differences between two tables. I have looked at sql query to return differences between two tables but it was a bit too different for me to extrapolate with my current SQL skills.

Table A is a snapshot of a certain group of people where the snapshot was taken yesterday, where each row is a unique person and certain characteristics about the person. Table B is the same snapshot taken 24 hours later. Within the 24 hour period:

  1. New people may have been added.
  2. People from yesterday may have been removed.
  3. People from yesterday may have changed (i.e., original row is there, but one or more column values have changed).

My output should have the following:

  1. a row for each new person added
  2. a row for each person removed
  3. a row for each person who has changed

I would grateful for any ideas. Thanks!

Community
  • 1
  • 1
Tim Clotworthy
  • 95
  • 2
  • 11

6 Answers6

5

This type of problem has a very simple and efficient solution that does not use joins (it doesn't even use a union of the results of two MINUS operations) - it just uses one union and a GROUP BY operation. The solution was developed in a thread on AskTom many years ago, it is surprising that it is not more widely known and used. For example (but not only): https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:24371552251735

In your case, assuming there is a primary key constraint on PERSON_ID (which makes the solution simpler):

select max(flag) as flag, PERSON_ID, first_name, last_name, (etc. - all the columns)
from   ( select 'old' as flag, t1.*
           from old_table t1
         union all
         select 'new' as flag, t2.*
           from new_table t2
       )
group by PERSON_ID, first_name, last_name, (etc.)
having  count(*) = 1
order by PERSON_ID  --     optional
;

If for a PERSON_ID all the data is the same in both tables, that will result in a count of 2 for that group. So it won't pass the HAVING condition. The only groups that will have a count of 1 (and therefore will be just one row each!) are either rows that are in one table but not the other. If a person was added, that will show only one row, with the flag = 'new'. If a person was deleted, you will get only one row, with the flag 'old'. If there were updates, the same PERSON_ID will appear twice, but since at least one field is different, the two rows (one with flag 'new' and the other with 'old') will be in different groups, they will pass the HAVING filter, and they will BOTH be in the output.

Which is slightly different from what you requested; you will get both the old AND the new information for updates, labeled as 'old' and 'new'. You said you wanted only one of those but didn't state which one. This will give you both (which makes more sense anyway), but if you really only want one, it can be done easily in the query above.

Note - the outer select must have max(flag) rather than flag because flag is not a GROUP BY column; but it's the max() over exactly one row, so it WILL be the flag for that row anyway.

Added - OP indicated he would like to get only the "new" row for a person with updated (changed, modified) data. The approach shown below will change the flag to "changed" in this case.

with old_table ( person_id, first_name, last_name ) as (
       select 101, 'John', 'Smith' from dual union all
       select 102, 'Mary', 'Green' from dual union all
       select 103, 'July', 'Dobbs' from dual union all
       select 104, 'Will', 'Scott' from dual
     ),
     new_table ( person_id, first_name, last_name ) as (
       select 101, 'Joe' , 'Smith' from dual union all
       select 102, 'Mary', 'Green' from dual union all
       select 104, 'Will', 'Scott' from dual union all
       select 105, 'Andy', 'Brown' from dual
     )
-- end of test data; solution (SQL query) begins below this line
select case ct when 1 then flag else 'changed' end as flag,
       person_id, first_name, last_name
from   (
select max(flag) as flag, person_id, first_name, last_name,
       count(*) over (partition by person_id) as ct,
       row_number() over (partition by person_id order by max(flag)) as rn
from   ( select 'old' as flag, t1.*
           from old_table t1
         union all
         select 'new' as flag, t2.*
           from new_table t2
       )
group by person_id, first_name, last_name
having  count(*) = 1
)
where rn = 1
order by person_id   --  ORDER BY clause is optional
;

Output:

FLAG     PERSON_ID FIRS_NAME LAST_NAME
------- ---------- --------- ---------
changed        101 Joe       Smith
old            103 July      Dobbs
new            105 Andy      Brown
  • Thanks mathguy. This worked great. So simple and elegant. One thing would improve it. As you stated, I didn't say whether for "changed" I want old, new, or both. Actually, my use case is simple in that I only need new. Do you know of a way to modify this so I can get a "changed" row rather than a pairing of the old and new row? Thanks again! – Tim Clotworthy Nov 07 '16 at 13:32
  • I will edit the answer to add that. I will change the flag to `changed` and provide only the row with the new data. (Not sure if you need to be able to differentiate between `new` and `changed/modified/updated`.) –  Nov 07 '16 at 13:42
  • I am trying your updates. I am getting an error: ORA-00905: missing keyword 00905. 00000 - "missing keyword" *Cause: *Action: Error at Line: 1 Column: 50. That corresponds to "as flag". Any ideas? – Tim Clotworthy Nov 07 '16 at 15:53
  • @TimClotworthy - Yes, I did it again, I left our the word END at the end of the CASE expression. Sorry. That can happen when I don't have test data to test the solutions on. :-) I fixed it now. Please let me know if you see other problems. Good luck! –  Nov 07 '16 at 17:04
  • sorry, I am still getting the error with your above code. I posted my code as answer as reply to my thread (it was too much to put in a comment). Is there anything you can see that would be causing the "missing keyword error? Thanks again! – Tim Clotworthy Nov 07 '16 at 19:50
  • @TimClotworthy - weird, my "edit" didn't take. I'll try again, but in any case, what is missing is **end** in the first line: `else 'changed' END as flag`... –  Nov 07 '16 at 20:56
  • Edited (again... or not?) just now, please give it another try. –  Nov 07 '16 at 20:57
  • Sorry, added the "end". New error. Something it does not like in the vicinity of "count() over" : ORA-00923: FROM keyword not found where expected 00923. 00000 - "FROM keyword not found where expected" *Cause: *Action: Error at Line: 4 Column: 20 – Tim Clotworthy Nov 08 '16 at 12:03
  • Right - that should be count(*). –  Nov 08 '16 at 12:17
  • Now you see why it is helpful to have a small sample of input data for testing! :-) When the original post provides test data, I always test my solutions before posting them. –  Nov 08 '16 at 12:18
  • I should have seen the missing *, however I am still getting the identical error. I am sorry for not providing input data. I will do so shortly when I get a moment – Tim Clotworthy Nov 08 '16 at 12:50
  • Do you have a comma after "your" last column and before count(\*)? The compiler is looking for the word FROM, which seems to suggest you don't have that comma before count(*). –  Nov 08 '16 at 12:52
  • Oh, I see you posted your query in a different Answer, and indeed, you are missing a comma after LASTNAME and before count(\*)! –  Nov 08 '16 at 12:54
  • Ok, I fixed that. Now getting a: ORA-00904: "FLAG": invalid identifier 00904. 00000 - "%s: invalid identifier" *Cause: *Action: Error at Line: 1 Column: 30 – Tim Clotworthy Nov 08 '16 at 13:03
  • Yep, where I selected `max(flag)` I should have renamed the result `as flag`. Editing now. –  Nov 08 '16 at 13:06
  • If I understand you correctly, rather than "select max(flag), PERSON_ID...", etc, it should say "select max(flag) as flag, PERSON_ID...", etc. If that is the case, I am getting a new error: ORA-00979: not a GROUP BY expression 00979. 00000 - "not a GROUP BY expression" *Cause: *Action: Error at Line: 6 Column: 53 – Tim Clotworthy Nov 08 '16 at 13:34
  • Ok... let's work on the sample input data, and let me debug the code. –  Nov 08 '16 at 13:59
  • Sorry for delay. Will the following two table exports will be sufficient? – Tim Clotworthy Nov 08 '16 at 14:54
  • REM INSERTING into TODAY_TABLE SET DEFINE OFF; Insert into TODAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (123,'Geroge','Washington'); Insert into TODAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (234,'John','Adams'); Insert into TODAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (890,'HHarry','Truman'); Insert into TODAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (987,'William','Taft'); – Tim Clotworthy Nov 08 '16 at 14:54
  • REM INSERTING into YESTERDAY_TABLE SET DEFINE OFF; Insert into YESTERDAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (123,'Geroge','Washington'); Insert into YESTERDAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (345,'Abraham','Lincoln'); Insert into YESTERDAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (890,'HHarry','Truman'); Insert into YESTERDAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (987,'WWilliam','Taft'); Insert into YESTERDAY_TABLE (PERSON_ID,FIRSTNAME,LASTNAME) values (876,'James','Monroe'); – Tim Clotworthy Nov 08 '16 at 14:56
  • @TimClotworthy - OK, I created some fake inputs in a WITH clause (not part of the solution) and fixed the SQL query. It's tested now. I fixed a few other mistakes - for example I was selecting from the same table ('old-table') which I now fixed. Please give it a try. –  Nov 08 '16 at 15:14
  • Oh - for some reason I didn't see your test data in my window but I see it now. Anyway - it should work just the same. –  Nov 08 '16 at 15:15
  • Thanks mathguy, you nailed it. How can I repay you? Do you need your leaves raked or your garage cleaned out? Thanks again. You went way beyond what I deserved for help! – Tim Clotworthy Nov 08 '16 at 18:59
  • @TimClotworthy - No worries, I am just happy if I can help. "Paying it forward" if you are familiar with the concept. –  Nov 08 '16 at 20:39
0

The first 2 parts are easy: select 'New', name from B where not exists (select name from A where A.name=B.name) union select 'Removed', name from A where not exists (select name from B where B.name = A.name)

The last one is where you need to compare characteristics. How many of them are there? Do you want to list what has changed or only that they have changed?

For argument's sake, let us only say that the characteristics are address and telephone #: union select 'Phone', name from A,B where A.name = B.name and A.telephone != B.telephone union select 'Address', name from A,B where A.name = B.name and A.address != B.address

Tom K.
  • 37
  • 4
0

Note: The question isn't currently tagged with the dbms. I use sql-server, so that's what I used to write the below. There may be slight differences in another dbms.

You can do something along these lines:

select  *
from    TableA a
        left join TableB b on b.ID = a.ID
where   a.ID is null -- added since yesterday
union
select  *
from    TableA a
        left join TableB b on b.ID = a.ID
where   b.ID is null -- removed since yesterday
union
select  *
from    TableA a
        inner join TableB b on b.ID = a.ID -- restrict to records in both tables
where   a.SomeValue <> b.SomeValue
or      a.SomeOtherValue <> b.SomeOtherValue
--etc

Each select handles one portion of your expected output. In this manner, they'd all be joined into 1 result set. If you drop the union, you'll end up with a separate set for each.

levelonehuman
  • 1,465
  • 14
  • 23
0

I suggest to use Except to get the changed records. The below query should work if the db is sql server.

-- added since yesterday
SELECT B.*
FROM TableA A
LEFT Outer Join TableB B on B.ID = A.ID
WHERE A.ID IS NULL

UNION

-- removed since yesterday
SELECT A.*
FROM TableA A
LEFT OUTER JOIN TableB B on B.ID = A.ID
WHERE B.ID IS NULL 

UNION

-- Those changed with values from yesterdady
SELECT B.* FROM TableB B WHERE EXISTS(SELECT A.ID FROM TableA A WHERE A.ID = B.ID) 
EXCEPT
SELECT A.* FROM TableA A WHERE EXISTS(SELECT B.ID FROM TableB B WHERE B.ID = A.ID)
0

Assuming you have a unique id for each person in the able, you can use full outer join:

select coalesce(ty.customerid, tt.customerid) as customerid,
      (case when ty.customerid is null then 'New'
            when tt.customerid is null then 'Removed'
            else 'Modified'
       end) as status
from tyesterday ty full outer join
     ttoday tt
     on ty.customerid= tt.customerid
where ty.customerid is null or
      tt.customerid is null or
      (tt.col1 <> ty.col1 or tt.col2 <> ty.col2 or . . . );  -- may need to take `NULL`s into account
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

mathguy provided a successful answer to my initial problem. I asked him for a revision (to make it even better). He provided a revision, but I am getting a "missing keyword" error when executing against my code. Here is my code:

select case when ct = 1 then flag else 'changed' as flag, PERSON_ID, FIRSTNAME, LASTNAME
from (
   select max(flag), PERSON_ID, FIRSTNAME, LASTNAME
          count() over (partition by PERSON_ID) as ct,
          row_number() over (partition by PERSON_ID 
                             order by case when flag = 'new' then 0 end) as rn
   from   ( select 'old' as flag, t1.*
              from YESTERDAY_TABLE t1
            union all
            select 'new' as flag, t2.*
              from TODAY_TABLE t2
          )
   group by PERSON_ID, FIRSTNAME, LASTNAME
          having  count(*) = 1
 )
where rn = 1
order by PERSON_ID;
Tim Clotworthy
  • 95
  • 2
  • 11