Add ID to duplicates in two columns

Question

I have a table representing trade exchanges between cities and I'd like to add an id that would indicate groups of same origin/destination and destination/origin alike.

For example:

| origin | destination
|--------|------------
| 8      | 2
| 2      | 8
| 8      | 2
| 8      | 5
| 8      | 5
| 9      | 1
| 1      | 9

would become:

| id | origin | destination
|----|--------|------------
| 0  | 8      | 2
| 0  | 2      | 8
| 0  | 8      | 2
| 1  | 8      | 5
| 1  | 8      | 5
| 2  | 9      | 1
| 2  | 1      | 9

I can have same origin/destination but I can also have origin/destination = destination/origin and I want all of those groups identified.

Erwin Brandstetter · Accepted Answer · 2019-02-05T16:45:48.233

2

One way: with the window function dense_rank() and GREATEST / LEAST:

SELECT dense_rank() OVER (ORDER BY GREATEST(origin, destination)
                                 , LEAST   (origin, destination)) - 1 AS id
     , origin, destination
FROM   trade;

db<>fiddle here

- 1 to start with 0 like your example.

edited Feb 05 '19 at 16:45

answered Feb 05 '19 at 16:33

Erwin Brandstetter

605,456
145
1,078
1,228

1

Would `dense_rank() OVER (ORDER BY array(SELECT unnest(array[origin, destination, ...]) ORDER BY 1))` be a reasonable way to generalize this to more than two columns? – unutbu Feb 05 '19 at 16:52
Thanks, it worked great. 4 sec for a one million entry table ! – BFlat Feb 05 '19 at 16:58
@unutbu: Depends on exact definition of groups. With more than two columns, it spreads out like a *binomial distribution*. Is `(a,b,c)` supposed to match `(a,c,b)` as well as `(b,a,c)` etc.? Consider: https://stackoverflow.com/a/18361539/939860 If column order is irrelevant, it is still considerably more expensive to unnest, sort, aggregate like in your example (`ORDER BY 1 DESC` to stay in sync). While it's all `integer` I'd consider faster module `intarray` for this. See: https://stackoverflow.com/a/12870508/939860 – Erwin Brandstetter Feb 05 '19 at 18:24
@unutbu: With `dense_rank() OVER (ORDER BY sort_desc(ARRAY[origin, destination, ...])) AS id` See: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=6fd74f41d64005060041e60bdddd338f Really a valid new *question*. – Erwin Brandstetter Feb 05 '19 at 18:25

Add ID to duplicates in two columns

1 Answers1