Add number to rows based on identical values in selected columns

Question

I have a PostgreSQL database that contains traffic tickets written by a few jurisdictions.

Some jurisdictions don't indicate if multiple tickets are written in one traffic stop. However, that can be inferred by analyzing other fields. Consider this data:

ticket_id  timestamp            drivers_license
----------------------------------------------
1          2008-08-07 01:51:00  11111111
2          2008-08-07 01:51:00  11111111
3          2008-08-07 02:02:00  22222222
4          2008-08-07 02:25:00  33333333
5          2008-08-07 04:23:00  44444444
6          2008-08-07 04:23:00  55555555
7          2008-08-07 04:23:00  44444444

I can infer that:

Tickets 1 & 2 were written in a single traffic stop because they share driver's license numbers and timestamps.
Same for 5 & 7, but notice how ticket 6 is between them. Perhaps another officer was writing a ticket at the same time somewhere else, or data entry operators entered stuff out of order.

I would like to add another column that has a unique ID for each traffic stop. It will not be a primary key for the table because it will have duplicate values. For example:

ticket_id  timestamp            drivers_license  stop_id
--------------------------------------------------------
1          2008-08-07 01:51:00  11111111         1
2          2008-08-07 01:51:00  11111111         1
3          2008-08-07 02:02:00  22222222         2
4          2008-08-07 02:25:00  33333333         3
5          2008-08-07 04:23:00  44444444         4
6          2008-08-07 04:23:00  55555555         5
7          2008-08-07 04:23:00  44444444         4

I can think of computationally-intensive, greedy algorithm ways of doing this with C#, but is there an efficient SQL query that can work?

I fail to see how simply adding another column is different from having already a foreign key to driver licenses ... — tereško, Mar 08 '12 at 04:05
Yup. A single motorist (single DL) could be cited on different occasions. — Aren Cambre, Mar 09 '12 at 03:55

score 3 · Answer 1 · answered Mar 08 '12 at 06:55

If you employ the window function rank() this becomes amazingly simple:

SELECT *
      ,rank() OVER (ORDER BY ts, drivers_license)
FROM   tbl
ORDER  BY ticket_id

Returns exactly what you asked for.

I renamed your column timestamp to ts, because timestamp is a type name in PostgreSQL and a reserved word in every SQL standard.

score 1 · Answer 2 · answered Mar 08 '12 at 04:10

1

Probably your best bet is to create a new table (say, "stops") with DISTINCT timestamps and drivers' license #s, assign row numbers, then update the ticket table from that new table.

answered Mar 08 '12 at 04:10

David Faber

12,277
2
29
40

ImGreg · Accepted Answer · 2012-03-08T14:27:32.907

1

Efficient SQL Query FTW!

I'm not at a computer that I can test this on so there is likely some syntax problems; I will fix in the morning, but it is something like this:

WITH uniquez as (SELECT timestamp, drivers_license, 
rank() over (ORDER BY timestamp, drivers_license) as counterz 
FROM ticketTable)

UPDATE ticketTable TT
SET stop_id = uniquez.counterz
WHERE uniquez.timestamp = TT.timestamp
AND uniquez.drivers_license = TT.drivers_license

Basically, you make a select that groups (partitions) by timestamp and drivers_license and have a row counter that goes with it. When you do the update, you use this previous select table's row counter as your "stop_id" and updates the columns that match the timestamp and drivers license.

edited Mar 08 '12 at 14:27

answered Mar 08 '12 at 04:18

ImGreg

2,946
16
43
65

1

Whether postgreSQL allows it or not, it's a very bad idea to use ROW_NUMBER() without an ORDER BY clause. In any case, I don't think what you have here works at all. If you partition by timestamp, drivers_license, the row numbering will start at 1 again each time timestamp, drivers_license changes. If you change partition by to ORDER BY, you are closer, but I think you would want DENSE_RANK(), not row_number(). – Steve Kass Mar 08 '12 at 05:58
@SteveKass definitely agree. Late night programming will have that effect on things. I will edit. – ImGreg Mar 08 '12 at 14:25
-1 You just copied my correct version over your incorrect one without crediting. That's not the recommended way around here - to put it politely. – Erwin Brandstetter Mar 09 '12 at 02:24
I'm getting a strange error with the query I constructed based on this. I opened a separate question at http://stackoverflow.com/questions/9643859/postgres-missing-from-clause-entry-error-on-query-with-with-clause. – Aren Cambre Mar 10 '12 at 04:11
Turns out you're missing a FROM clause in the UPDATE statement. After the SET line, you need a **FROM uniquez**. – Aren Cambre Mar 10 '12 at 04:36

score -1 · Answer 4 · answered Mar 08 '12 at 04:10

-1

SELECT ticket_id,timestamp,drivers_license,substr(drivers_license,1,1) as stop_id
FROM traffic_data;

Hope this works for u... :)

answered Mar 08 '12 at 04:10

Teja

13,214
36
93
155

It's possible that a person could receive separate tickets in separate stops, so that couldn't work, unfortunately. – Aren Cambre Mar 08 '12 at 17:33

Add number to rows based on identical values in selected columns

4 Answers4