DISTINCT with two array_agg (or one array_agg with tuple inside)?

Question

I've got the following query:

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()

The output looks like that:

groundtruth | p           | anchor_array | id_array
------------------------------------------------------
G1          | P1          | {1,3,3,3,4}  | {1,2,3,4,5}
G2          | P1          | {1,5,7}      | {6,7,8}
G1          | P2          | {1,3,3,3,4}  | {1,2,3,4,5}

Alternative query:

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(row(m.anchor_id, m.id))
...

Output:

groundtruth | p           | combined_array
-----------------------------------------------------------
G1          | P1          | {(1,1),(3,2),(3,3),(3,4),(4,5)}
G2          | P1          | {(1,6),(5,7),(7,8)}
G1          | P2          | {(1,1),(3,2),(3,3),(3,4),(4,5)}

What I want to achieve:

Getting rid of duplicate entries inside anchor_array
And for each deleted item: Delete the item from the id_array with the same index

Or for the alternative query and output:

Make each tuple distinct concerning the first entry of the tuple

What the result should look like:

groundtruth | p           | anchor_array | id_array
------------------------------------------------------
G1          | P1          | {1,3,4}      | {1,2,5}
G2          | P1          | {1,5,7}      | {6,7,8}
G1          | P2          | {1,3,4}      | {1,2,5}

Or for the alternative query and output:

groundtruth | p           | combined_array
-----------------------------------------------------------
G1          | P1          | {(1,1),(3,2),(4,5)}
G2          | P1          | {(1,6),(5,7),(7,8)}
G1          | P2          | {(1,1),(3,2),(4,5)}

P.S. I have ignored the randomization part in the example output for better overview.

Real result set:

p                                           ; groundtruth                                ; ids
---------------------------------------------------------------------------------------------
"0101000000EE7C3F355EF24F4019390B7BDA011940";"010100000094F6065F98E44F40A930B610E4A01B40";"{"(29,250)","(30,251)","(31,241)","(32,263)","(33,243)","(34,264)","(35,277)"}"
"0101000000EE7C3F355EF24F40809F71E140681940";"010100000094F6065F98E44F40A930B610E4A01B40";"{"(29,250)","(30,251)","(31,257)","(32,276)","(33,272)","(34,264)","(35,249)"}"
"0101000000EE7C3F355EF24F40E605D847A7CE1940";"0101000000EE7C3F355EF24F4019390B7BDA011940";"{"(30,194)","(31,181)","(32,168)","(33,124)","(34,141)","(35,4)"}"
"0101000000EE7C3F355EF24F404C6C3EAE0D351A40";"010100000014D044D8F0DC4F4073BA2C2636DF1C40";"{"(30,281)","(31,278)","(32,297)","(33,284)","(34,294)","(35,303)"}"
"0101000000EE7C3F355EF24F40B3D2A414749B1A40";"0101000000DE9387855AEB4F4062670A9DD7581A40";"{"(30,235)","(31,214)","(32,220)","(33,221)","(34,217)","(35,232)"}"
"0101000000EE7C3F355EF24F4019390B7BDA011B40";"0101000000AF94658863D54F40A7E8482EFF211E40";"{"(27,316)","(31,329)","(32,334)","(33,340)","(34,327)","(35,324)"}"
"0101000000EE7C3F355EF24F40809F71E140681B40";"0101000000DE9387855AEB4F4062670A9DD7581A40";"{"(30,224)","(31,210)","(32,220)","(33,230)","(34,226)","(35,213)"}"
"0101000000EE7C3F355EF24F40E605D847A7CE1B40";"010100000014D044D8F0DC4F4073BA2C2636DF1C40";"{"(30,281)","(31,304)","(32,288)","(33,293)","(34,306)","(35,295)"}"
"0101000000EE7C3F355EF24F404C6C3EAE0D351C40";"010100000094F6065F98E44F40A930B610E4A01B40";"{"(29,250)","(30,256)","(31,257)","(32,271)","(33,254)","(34,260)","(35,277)"}"
"0101000000EE7C3F355EF24F4019390B7BDA011D40";"010100000007F0164850C44F405F46B1DCD24A2040";"{"(31,383)","(32,409)","(33,390)","(34,411)","(35,407)"}"

score 9 · Accepted Answer · edited May 23 '17 at 12:32

9

Similar to what I answered at your preceding question, just with ARRAY of rows like you suggested and shorter positional notation:

SELECT DISTINCT ON (1)
       p, groundtruth, array_agg(ROW(anchor_id, id)) AS ids
FROM (
   SELECT DISTINCT ON (1, 2, 3)
          ps.p, m.groundtruth, m.anchor_id, m.id
   FROM  (SELECT unnest(point_array) AS p) AS ps
   JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
   ORDER  BY 1, 2, 3, random()
   ) x
GROUP  BY 1, 2
ORDER  BY 1, random();

But I like the other version with a 2-dimensional array better.

edited May 23 '17 at 12:32

Community

1
1

answered Feb 27 '13 at 02:13

Erwin Brandstetter

605,456
145
1,078
1,228

1

ERROR in Line 3: SELECT DISTINT ON (1,2) ... it's hinting at the `2` ... SELECT DISTINCT ON expressions must match initial ORDER BY expressions – Benjamin M Feb 27 '13 at 02:20
One last question: Is it possible to remove the inner random()? The generated result's seem okay, but I'm not 100% sure ;) I'll append a REAL resultset at the end of the question. – Benjamin M Feb 27 '13 at 03:23
@BenjaminM: Removing the inner `random()` switches to arbitrary picks for `id` where there are multiple `anchor_id`. If that's good enough - should be yet a bit faster. BTW: Notational parameters, like I display here, only make for shorter syntax, but not for faster queries. You probably know that. – Erwin Brandstetter Feb 27 '13 at 03:30
Yeah, I recognized that the results are 'less random'. But with 50 input values I can't see a difference in speed. Thank you sooo much. – Benjamin M Feb 27 '13 at 03:33

DISTINCT with two array_agg (or one array_agg with tuple inside)?

Real result set:

1 Answers1

Linked