0

I have a table that among the columns there are 2 of interest:

external_id unsigned int
processed_date date

I expect that the external_id is increasing along with the processed_date. But how can I verify this? I tried using a cartesian product like:

select * from tableA as a , tableA as b
where a.external_id > b.external_id and a.processed_date < b.processed_date

but it takes too long to finish.

Is there a better way to do this?

Jim
  • 18,826
  • 34
  • 135
  • 254
  • User variables to simulate LEAD function in other dbs. – Mihai Feb 02 '15 at 13:17
  • Is this for a one-off project, or do you plan to do it regularly? – Sergey Kalinichenko Feb 02 '15 at 13:19
  • @dasblinkenlight :One time thing – Jim Feb 02 '15 at 13:21
  • That's not really a Cartesian product. It's just a join. Properly indexed, it should be as fast any other method available. – Strawberry Feb 02 '15 at 13:32
  • @Strawberry The solution above will have to do O(N^2) comparisons even with the best of the indexes present. This may be prohibitively expensive, especially when you can do it in O(N*LogN) at the expense of doubling the amount of storage needed (i.e. O(N) space). – Sergey Kalinichenko Feb 02 '15 at 13:34
  • @Strawberry:IMO it is a cartesian product. Each row in instance a is matched to each row of instance b on an inequality giving a huge result set. – Jim Feb 02 '15 at 22:00

1 Answers1

1

Since this is a one-time project, you can create a temporary table with row numbers, and then do the query that compares row N only to row N+1 (and rely on transitivity of < for all other rows):

SET @row_num:=0;
INSERT INTO my_temp (row_number, proc_date, ext_id)
    SELECT
        @row_num:=@row_num+1 as row_number
    ,   proc_date
    ,   ext_id
    FROM original_table
    ORDER BY proc_date

With row_number in place, you can search like this:

SELECT *
FROM my_temp a
JOIN my_temp b ON a.row_number = b.row_number+1
WHERE a.ext_id >= b.ext_id

The trick to this query is to identify the next row in the table sorted in ascending order by proc_date. But that is exactly what row_number+1 means. You may need to create an index on row_number, or declare it a unique key in order for this query to finish in reasonable time.

I was interested to see roughly how often it happens.

I would do it in a hybrid SQL/Java solution (or used whatever other language you may prefer). Firs, load external IDs alone, ordered on the date, into main memory, i.e.

SELECT ext_id FROM original_table ORDER BY proc_date

Then I would use an O(N*LogN) algorithm for counting the number of inversions. Here is an implementation in Java.

Community
  • 1
  • 1
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • I like this approach but what I think it misses the following case: If dateB is not directly after dateA and we have in the following order: {10, 20, 30, 40, 50, 2, 3, } then it will find only 2 and not 3 since 2,3 are one after the other. – Jim Feb 02 '15 at 14:22
  • @Jim Correct, a single reversal may count for many items in front of it. My understanding of the problem was that you suspect that the external IDs are assigned in ascending order, and all you wanted was to confirm or disprove this assumption. – Sergey Kalinichenko Feb 02 '15 at 14:38
  • You are right but I was interested to see roughly how often it happens. E.g. your example works but for a specific period gives only 2 rows where I know at least 10 more cases. So my interest is to find if not accurately at least roughly how often/how much are the numbers off. E.g. 5 in a period of 5 years is not significant. But 1500 is something that is very significant – Jim Feb 02 '15 at 14:48