3

I have a PostgreSQL database that contains traffic tickets written by a few jurisdictions.

Some jurisdictions don't indicate if multiple tickets are written in one traffic stop. However, that can be inferred by analyzing other fields. Consider this data:

ticket_id  timestamp            drivers_license
----------------------------------------------
1          2008-08-07 01:51:00  11111111
2          2008-08-07 01:51:00  11111111
3          2008-08-07 02:02:00  22222222
4          2008-08-07 02:25:00  33333333
5          2008-08-07 04:23:00  44444444
6          2008-08-07 04:23:00  55555555
7          2008-08-07 04:23:00  44444444

I can infer that:

  • Tickets 1 & 2 were written in a single traffic stop because they share driver's license numbers and timestamps.
  • Same for 5 & 7, but notice how ticket 6 is between them. Perhaps another officer was writing a ticket at the same time somewhere else, or data entry operators entered stuff out of order.

I would like to add another column that has a unique ID for each traffic stop. It will not be a primary key for the table because it will have duplicate values. For example:

ticket_id  timestamp            drivers_license  stop_id
--------------------------------------------------------
1          2008-08-07 01:51:00  11111111         1
2          2008-08-07 01:51:00  11111111         1
3          2008-08-07 02:02:00  22222222         2
4          2008-08-07 02:25:00  33333333         3
5          2008-08-07 04:23:00  44444444         4
6          2008-08-07 04:23:00  55555555         5
7          2008-08-07 04:23:00  44444444         4

I can think of computationally-intensive, greedy algorithm ways of doing this with C#, but is there an efficient SQL query that can work?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Aren Cambre
  • 6,540
  • 9
  • 30
  • 36

4 Answers4

3

If you employ the window function rank() this becomes amazingly simple:

SELECT *
      ,rank() OVER (ORDER BY ts, drivers_license)
FROM   tbl
ORDER  BY ticket_id

Returns exactly what you asked for.

I renamed your column timestamp to ts, because timestamp is a type name in PostgreSQL and a reserved word in every SQL standard.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

Probably your best bet is to create a new table (say, "stops") with DISTINCT timestamps and drivers' license #s, assign row numbers, then update the ticket table from that new table.

David Faber
  • 12,277
  • 2
  • 29
  • 40
1

Efficient SQL Query FTW!

I'm not at a computer that I can test this on so there is likely some syntax problems; I will fix in the morning, but it is something like this:

WITH uniquez as (SELECT timestamp, drivers_license, 
rank() over (ORDER BY timestamp, drivers_license) as counterz 
FROM ticketTable)

UPDATE ticketTable TT
SET stop_id = uniquez.counterz
WHERE uniquez.timestamp = TT.timestamp
AND uniquez.drivers_license = TT.drivers_license

Basically, you make a select that groups (partitions) by timestamp and drivers_license and have a row counter that goes with it. When you do the update, you use this previous select table's row counter as your "stop_id" and updates the columns that match the timestamp and drivers license.

ImGreg
  • 2,946
  • 16
  • 43
  • 65
  • 1
    Whether postgreSQL allows it or not, it's a very bad idea to use ROW_NUMBER() without an ORDER BY clause. In any case, I don't think what you have here works at all. If you partition by timestamp, drivers_license, the row numbering will start at 1 again each time timestamp, drivers_license changes. If you change partition by to ORDER BY, you are closer, but I think you would want DENSE_RANK(), not row_number(). – Steve Kass Mar 08 '12 at 05:58
  • @SteveKass definitely agree. Late night programming will have that effect on things. I will edit. – ImGreg Mar 08 '12 at 14:25
  • -1 You just copied my correct version over your incorrect one without crediting. That's not the recommended way around here - to put it politely. – Erwin Brandstetter Mar 09 '12 at 02:24
  • I'm getting a strange error with the query I constructed based on this. I opened a separate question at http://stackoverflow.com/questions/9643859/postgres-missing-from-clause-entry-error-on-query-with-with-clause. – Aren Cambre Mar 10 '12 at 04:11
  • Turns out you're missing a FROM clause in the UPDATE statement. After the SET line, you need a **FROM uniquez**. – Aren Cambre Mar 10 '12 at 04:36
-1
SELECT ticket_id,timestamp,drivers_license,substr(drivers_license,1,1) as stop_id
FROM traffic_data;

Hope this works for u... :)

Teja
  • 13,214
  • 36
  • 93
  • 155
  • It's possible that a person could receive separate tickets in separate stops, so that couldn't work, unfortunately. – Aren Cambre Mar 08 '12 at 17:33