How to identify cells where two rows differ?

Question

Sometimes I wonder if there is a good practise to identify the cell(s) in which two rows of a database table differ when they are supposed to be eqal. This is a very simple scenario for the problem:

Two tables with one row each:

select 1 c1, 2 c2, 3 c3 into #t1
select 1 c1, 2 c2, 4 c3 into #t2

The two rows are identified as different (It is not my question how to identify these rows. In this szenario the doublicate is identified by the union-operator. I could use a binary-checksum or whatever. This is not the question.)

select * from #t1
union
select * from #t2

In this case it is obvious (very simple and fast) to see that c3 (value 3 and 4)makes the difference between the two rows which are supposed to be equal.

1   2   3
1   2   4

How to identify the mismatching cells, when there are hunderts of columns? (Don't tell me not to create such tables...)

Is there a generic way? I'd prefer some kind of 'highlightning' the cells that differ. (Maybe use R-in-database and export to excel?...)

Another way to expain the problem:

I have a table where id should be an unique identifier - but it is not. I have to analyze the data in the table.

 if object_id('tempdb..#t1') is not null drop table #t1

 create table #t1 (
 id int,
 c01 int,
 c02 int,
 c03 int,
 c04 int,
 c05 int,
 c06 int,
 c07 int,
 c08 int,
 c09 int,
 c10 int,
 c11 int,
 c12 int,
 c13 int,
 c14 int,
 c15 int,
 c16 int,
 c17 int,
 c18 int,
 c19 int,
 c20 int,
 c21 int,
 c22 int,
 c23 int,
 c24 int,
 c25 int,
 c26 int,
 c27 int,
 c28 int,
 c29 int,
 c30 int,
 c31 int,
 c32 int,
 c33 int,
 c34 int,
 c35 int,
 c36 int,
 c37 int,
 c38 int,
 c39 int,
 c40 int,
 c41 int,
 c42 int,
 c43 int,
 c44 int,
 c45 int,
 c46 int,
 c47 int,
 c48 int,
 c49 int
 )

 insert #t1 (id, c11) values (1, 1)
 insert #t1 (id, c12) values (1, 1)

 insert #t1 (id, c11) values (2, 1)
 insert #t1 (id, c11) values (3, 1)

 insert #t1 (id, c21) values (4, 1)
 insert #t1 (id, c32) values (4, 1)

That's what I do: I have a look at the data "with problems", that is data with differnt values in some cells. This task is trivial when there are very few columns. When there are hundrets of columns (and not nearly everything is NULL like in this example) this a a demanding task.

 select a.* from #t1 a
 inner join (select id from #t1 group by id having count(*) > 1) b
 on a.id = b.id
 order by id

I would like to see the name of the cells that differ. In this example for example a resultset like this one would do.

 id columnname
 1  c11
 1  c12
 4  c21
 4  c32

Yellowfilled cells in SSMS would also be fine...

best is to do it in excel and not in DB, `union all` for each table, `condition formatting > highlight duplicate values` ... those that aren't highlighted is what you looking for — Veljko89, Jan 30 '18 at 13:22
And if each table contains 10 rows what are you expecting? It is not at all clear to me what you are trying to accomplish here. — Sean Lange, Jan 30 '18 at 14:30
The two rows are just samples for two single rows that differ in single cells (and should be identic). Like in a "group by over all columns having count(*) > 1" analyzing doublicates. Maybe I should change the query to union to clearify that? — Christian4145, Jan 30 '18 at 14:46
I would definitely create a couple of test tables with about 10 rows each and post those. Then explain what you want as output. — Sean Lange, Jan 30 '18 at 14:55

John Cappelletti · Accepted Answer · 2018-01-30T16:30:12.860

2

The following will "dynamically" unpivot your data by using a little XML.

UnPivot would be more performant, but at least here you don't have to specify all the fields

Example

;with cte as (
Select A.ID
      ,C.*
      ,Fields = sum(1) over (Partition by A.ID)
      ,Vals   = sum(1) over (Partition by A.ID,Field,Value)
 From #t1 A
 Cross Apply ( values (cast((Select A.* for XML RAW) as xml))) B(XMLData)
 Cross Apply (
                Select Field = a.value('local-name(.)','varchar(100)')
                      ,Value = a.value('.','varchar(max)') 
                 From  B.XMLData.nodes('/row')  as C1(n)
                 Cross Apply C1.n.nodes('./@*') as C2(a)
                 Where a.value('local-name(.)','varchar(100)') not in ('id','OtherExcludeColumns')
             ) C
) 
Select ID
      ,Field
 From  cte
 Where Fields>1 and Vals=1

Returns

ID  Field
1   c11
1   c12
4   c21
4   c32

EDIT

A shortcut just to find rows which have different values

Select A.ID
 From  #T1 A
 Cross Apply ( values ((Select A.* for XML RAW) )) B(XMLData) 
 Group by A.ID
 Having min(XMLData)<>max(XMLData)

Returns

ID
1
4

edited Jan 30 '18 at 16:30

answered Jan 30 '18 at 16:23

John Cappelletti

79,615
7
44
66

Didn't get the idea ot your where clause completely, but if there are identic values in a unique row they would be counted as different. See in insert for id2: insert #t1 (id, c11, c12, c13) values (2, 1,1,1) – Christian4145 Jan 30 '18 at 16:35
@Christian4145 I'll take a second peek. the sample data was a little sparse so I may need to tweak the logic. – John Cappelletti Jan 30 '18 at 16:37
@Christian4145 I think were good. (assuming you're using the updated answer). For example, I mate C11 = 1 for ID 1 and it returned C12 for ID 1 – John Cappelletti Jan 30 '18 at 16:42
@Christian4145 If it helps, just make the final Select as "Select * from cte" you may get a better visual. – John Cappelletti Jan 30 '18 at 16:44
If you try C12 = 1 for ID 2 I think it is not working – Christian4145 Jan 30 '18 at 16:51
OK- I got it. Nice one! You just have to filter the problematic rows before. – Christian4145 Jan 30 '18 at 16:55
@Christian4145 Well done! Figured you could drag it over the finish line. – John Cappelletti Jan 30 '18 at 16:57
You can use `UNPIVOT` without specifying all the fields if you generate the list of columns dynamically. See https://stackoverflow.com/a/13377114/108326. – markusk Jan 31 '18 at 08:11
@markusk very true. if you don't mind going dynamic. Also, this approach is not limited to a table or view... the source can be any query with or without derived values. – John Cappelletti Jan 31 '18 at 13:04

markusk · Answer 2 · 2018-01-31T08:09:49.377

0

Use UNPIVOT to convert the table to one row per column, and compare matching columns:

 SELECT tu1.cell_name, tu1.cell_value, tu2.cell_value
 FROM (SELECT cell_name, cell_value
       FROM #t1
       UNPIVOT (cell_value FOR cell_name IN c1, c2, c3)) tu1
 JOIN (SELECT cell_name, cell_value
       FROM #t2
       UNPIVOT (cell_value FOR cell_name IN c1, c2, c3)) tu2
 ON (tu1.cell_name = tu2.cell_name AND tu1.cell_value <> tu2.cell_value);

Untested, SQL Fiddle is acting up.

If you don't know the columns in advance, see https://stackoverflow.com/a/13377114/108326 for how to call UNPIVOT dynamically. This approach retrieves the column list from database metadata, then generates an UNPIVOT query on the fly.

edited Jan 31 '18 at 08:09

answered Jan 30 '18 at 15:10

markusk

6,477
34
39

This is no simple task as I don't know the name of the columns in advance. I am looking for a generic aproach. – Christian4145 Jan 30 '18 at 16:03
Extended my answer with a link to how to generate a column list dynamically, so that you can use `UNPIVOT` even if you don't know the name of the columns in advance. – markusk Jan 31 '18 at 08:10

How to identify cells where two rows differ?

2 Answers2