Compare different orders of the same table

Question

I have this following scenario, a table with these columns:

table_id|user_id|os_number|inclusion_date

In the system, the os_number is sequential for the users, but due to a system bug some users inserted OSs in wrong order. Something like this:

table_id | user_id | os_number | inclusion_date
-----------------------------------------------
1        | 1       | 1         | 2015-11-01
2        | 1       | 2         | 2015-11-02
3        | 1       | 3         | 2015-11-01

Note the os number 3 inserted before the os number 2

What I need:

Recover the table_id of the rows 2 and 3, which is out of order.

I have these two select that show me the table_id in two different orders:

select table_id from table order by user_id, os_number

select table_id from table order by user_id, inclusion_date

I can't figure out how can I compare these two selects and see which users are affected by this system bug.

intersection of the two select statements will result in those table_id which hasn't got affected by the bug. thus a sql query excluding this specific table_id shall give you the list of table_id that got affected. I might be wrong though. — souravlahoti, Nov 30 '15 at 12:16
The real question is: why do you have the `os_number` at all? It seems redundant if it can be "calculated" by the order of the `inclusion_date` — , Nov 30 '15 at 13:24
@a_horse_with_no_name it could be a simplified version of his query. For instance, os_number might be a foreign key which would indirectly point to a version number. — Julien Blanchard, Nov 30 '15 at 15:49
@JulienBlanchard: that's why I wrote "it *seems* redundant" - given the information in the question it is - at least for me. — , Nov 30 '15 at 15:50

score 2 · Answer 1 · answered Nov 30 '15 at 14:44

2

Your question is a bit difficult because there is no correct ordering (as presented) -- because dates can have ties. So, use the rank() or dense_rank() function to compare the two values and return the ones that are not in the correct order:

select t.*
from (select t.*,
             rank() over (partition by user_id order by inclusion_date) as seqnum_d,
             rank() over (partition by user_id order by os_number) as seqnum_o
      from t
     ) t
where seqnum_d <> seqnum_o;

answered Nov 30 '15 at 14:44

Gordon Linoff

1,242,037
58
646
786

Well, you can always just break off the ties with `ROW_NUMBER() OVER (ORDER BY inclusion_date, os_number) seqnum_d`. Otherwise, your solution also runs into the issue of ties, since it will always return rows when different `os_numbers` are on the same `inclusion_date`. – Julien Blanchard Nov 30 '15 at 15:55
@JulienBlanchard . . . The issue isn't "breaking ties". For the sample data, both 1-3-2 and 3-1-2 meet the OPs requirements. The "ties" could result in correct orderings being described as incorrect. – Gordon Linoff Nov 30 '15 at 19:03
from what I understand, the issue is that `row_number()`'s result will vary based on the order of the table that is given to it when there are ties in its clause. Isn't that kinda irrelevant if you make sure there are no ties? – Julien Blanchard Nov 30 '15 at 19:17
@JulienBlanchard . . . The OP's sample data *has* ties. That makes `row_number()` inappropriate for this problem. – Gordon Linoff Nov 30 '15 at 19:24
Well, that's why I was suggestion to use `ORDER BY inclusion_date, os_number` instead of `ORDER BY inclusion_date` for the row number. Once you add in the `os_number`, there aren't anymore ties. Obviously, that's not a one-size-fits-all solution since it relies on the fact that you're comparing it to a row_number that's already ordered by `os_number` (so basically, you break ties with the same order as whatever you're comparing it to). – Julien Blanchard Nov 30 '15 at 19:46

score 1 · Answer 2 · answered Nov 30 '15 at 13:22

1

Use row_number() over both orders:

select *
from (
    select *, 
        row_number() over (order by os_number) rnn,
        row_number() over (order by inclusion_date) rnd
    from a_table
    ) s
where rnn <> rnd;

 table_id | user_id | os_number | inclusion_date | rnn | rnd 
----------+---------+-----------+----------------+-----+-----
        3 |       1 |         3 | 2015-11-01     |   3 |   2
        2 |       1 |         2 | 2015-11-02     |   2 |   3
(2 rows)

answered Nov 30 '15 at 13:22

klin

112,967
15
204
232

This is wrong since he wrote in question "`the os_number` is sequential for the users" so there is need of define partition for window function. So basicly it end up with exacly the same answer I've posted erlier. – Gabriel's Messanger Nov 30 '15 at 14:15
2

Stackoverflow is not a race. You shouldn't downvote a correct answer for the sole reason that your one was earlier. It's not fair. – klin Nov 30 '15 at 14:26
@Gabriel'sMessanger: "sequential" alone is not enough. `os_number` must also be **gapless** for your solution to work. – Nov 30 '15 at 14:32
@klin It's not speed what I am talking about or meant in any way. SO is not a race -- agreed on that one. I just said that **your answer isn't correct** because of no `PARTITION BY`. Talking about "posted erlier" I just meant that after correction you answer didn't add value that already posted one (which coincidentally was mine). – Gabriel's Messanger Nov 30 '15 at 14:44
@a_horse_with_no_name you're right, first version of my answer works also if `oc_number` isn't gapless. I didn't see that while I want to simplify my asnwer. I've go back to my first version and annotate why it was done. – Gabriel's Messanger Nov 30 '15 at 14:54
also, you should really change the `row_number() over (order by inclusion_date) rnd` to `row_number() over (order by inclusion_date, os_number) rnd`. Otherwise, this means that you rely on your table still being ordered by `os_number` from the previous query (or at least mention that the 2 fields cannot be swapped). Because if the 2 lines are swapped (`select *, rnd, rnn`) and the OSes are inserted on the same date in a messy order, you will get false positives. – Julien Blanchard Nov 30 '15 at 15:59

score 1 · Answer 3 · answered Nov 30 '15 at 14:23

1

Not entirely sure about the performance on this but you could use a cross apply on the same table to get the results in one query. This will bring up the pairs of table_ids which are incorrect.

select 
    a.table_id as InsertedAfterTableId,
    c.table_id as InsertedBeforeTableId
from table a
cross apply 
(
    select b.table_id
    from table b
    where b.inclusion_date < a.inclusion_date and b.os_number > a.os_number
) c

answered Nov 30 '15 at 14:23

momar

364
1
3

1

There is no `cross apply` in Postgres - the equivalent in standard SQL (and Postgres) would be a `cross join lateral` – Nov 30 '15 at 14:28

score 0 · Answer 4 · edited May 23 '17 at 11:59

I would use WINDOW FUNCTIONS to get row numbers in orders in question and then compare them:

SELECT
    sub.table_id,
    sub.user_id,
    sub.os_number,
    sub.inclusion_date,
    number_order_1, number_order_2
FROM (
    SELECT
        table_id,
        user_id,
        os_number,
        inclusion_date,
        row_number() OVER (PARTITION BY user_id
                           ORDER BY os_number
                                    ROWS BETWEEN UNBOUNDED PRECEDING
                                    AND UNBOUNDED FOLLOWING
        ) AS number_order_1,

        row_number() OVER (PARTITION BY user_id
                           ORDER BY inclusion_date
                                    ROWS BETWEEN UNBOUNDED PRECEDING
                                    AND UNBOUNDED FOLLOWING
        ) AS number_order_2
    FROM
        table
) sub    
WHERE
    number_order_1 <> number_order_1
;

EDIT:

Because of a_horse_with_no_name made good point about my final answer. I've back to my first answer (look in edit history) which work also if os_number isn't gapless.

score 0 · Answer 5 · answered Nov 30 '15 at 13:20

Both query examples given below simply check a mismatch between inclusion date and os_number:

This first query should return the offending row (the one whose os_number is off from its inclusion date)--in the case of the example row 3.

select table.table_id, table.user_id, table.os_number from table
 where EXISTS(select * from table t
 where t.user_id = table.user_id and
       t.inclusion_date > table.inclusion_date and
       t.os_number < table.os_number);

This second query will return the table numbers and users for two rows that are mismatched:

 select first_table.table_id, second_table.table_id, first_table.user_id from
  table first_table 
  JOIN table second_table
  ON (first_table.user_id = second_table.user_id and
      first_table.inclusion_date > second_table.inclusion_date and
      first_table.os_number < second_table.os_number);

score 0 · Answer 6 · answered Nov 30 '15 at 15:50

select *
from (
  select a_table.*,
  lag(inclusion_date) over (partition by user_id order by os_number) as last_date
  from a_table
) result
where last_date is not null AND last_date>inclusion_date;

This should cover gaps as well as ties. Basically, I simply check the inclusion_date of the last os_number, and make sure it's not strictly greater than the current date (so 2 version on the same date is fine).

Compare different orders of the same table

6 Answers6