Select a random entry from a group after grouping by a value (not column)?

Question

I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo, rgeo-activerecord and activerecord-postgis-adapter, but the Rails stuff is rather unimportant.

The table structure:

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

Example data:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

This table is some kind of manually created view for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.

Simple version:

Parameters:

Point p
int d

What the query should do:

1. The query looks for all groundtruth Points which have a distance < d from Point p

SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d

2. Now we have a list of groundtruth points with their anchor_ids. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3 and groundtruth=POINT(1 4).

3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data column is different.

Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1

My problem with all of this is: I can imagine a solution using SQL LOOPs and lot's of subqueries, but there's for sure a solution using GROUP BY or some other methods which will make it faster.

Full version:

Basically the same as above with one difference: The input parameters change:

lot's of Points p1 ... p312456345
still one d

If the simple query is working, this could be done using a LOOP in SQL. But maybe there is a better (and faster) solution, because the database is really huge!

Solution

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

Thanks to Erwin Brandstetter!

Do you need true randomness, or is an arbitrary pick good enough? And do you want the result ordered in an way (like "closest first")? — Erwin Brandstetter, Feb 26 '13 at 14:20
arbitrary is enough, i think. Does it make a difference within the query? — Benjamin M, Feb 26 '13 at 14:22
Yeah sure it is. But if it's just using `ARBITRATY_RANDOM()` instead of `RANDOM()` I could experiment with it and see if it has such a great impact as expected. — Benjamin M, Feb 26 '13 at 14:25
The point is, that a query like I supplied in my answer automatically makes an arbitrary pick and I don't need to employ an explicit function, which is much cheaper. But it is not truely random. — Erwin Brandstetter, Feb 26 '13 at 14:26
It's not random at all. I've executed the query 10 times and the result set is always identical. — Benjamin M, Feb 26 '13 at 14:28
That's to be expected, and that's what the term "arbitrary" allows for. PostgreSQL just picks the most convenient row, which will lead to the same result as long as the underlying table doesn't change. — Erwin Brandstetter, Feb 26 '13 at 14:30

Erwin Brandstetter · Accepted Answer · 2019-01-11T17:51:10.050

1

To eliminate duplicates, this might be the most efficient query in PostgreSQL:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

More about this query style:

Select first row in each GROUP BY group?

As mentioned in the comments this gives you an arbitrary pick. If you need random, somewhat more expensive:

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()

The second part is harder to optimize. EXISTS semi-join will probably be the fastest choice. For a given table ps (p point):

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

This can stop evaluating as soon as one p is close enough and it keeps the rest of the query simple.

Be sure to back that up with a matching GiST index.

If you have an array as input, create a CTE with unnest() on the fly:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

Update according to comment

If you only need a single row as answer, you can simplify:

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

Faster with `ST_DWithin()`

Probably more efficient with the function ST_DWithin() (and a matching GiST index!).
To get one row (using a sub-select instead of a CTE here):

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

To get one row for every point p within distance d:

SELECT DISTINCT ON (ps.p) *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

Adding ORDER BY random() will make this query more expensive. Without random(), Postgres can just pick the first matching row from the GiST index. Else all possible matches have to be retrieved and ordered randomly.

BTW, LIMIT 1 inside EXISTS is pointless. Read the manual at the link I provided or this related question.

edited Jan 11 '19 at 17:51

answered Feb 26 '13 at 14:25

Erwin Brandstetter

605,456
145
1,078
1,228

But that doesn't give me a random anchor-groundtruth-tuple – Benjamin M Feb 26 '13 at 14:26
Yeah sorry, I've overlooked the `*` :D ... Now what about the `Full version`? Maybe I could do something with `IN (p_array)`? – Benjamin M Feb 26 '13 at 14:40
Something important for the full version: The `SELECT` should also return the input values. And(!) for two input values `p1` and `p2` it is possible to return the same randomly chosen anchor-groundtruth-tuple. – Benjamin M Feb 26 '13 at 14:43
Sorry. I don't understand that `FROM ps` part. There is no table for those points. They are user input :-/ Or does your query create some kind of virtual table? – Benjamin M Feb 26 '13 at 14:54
Sorry for disturbing you again. **I misunderstood my own problem** ... **What I really need:** Input: p_array and d. Then: Select one random groundtruth that matches `st_distance(m.groundtruth, ps.p) < d`. Next select distinct anchor_id randomly. I'll try to figure it out myself. I'm really sorry... – Benjamin M Feb 26 '13 at 15:30
`LIMIT 1` inside the `EXISTS ( SELECT` doesn't do the trick. Maybe I have to say it in other words: For every input p, I want to receive one(!) random groundtruth within distance d. And for this groundtruth value the table in my question is valid: there are multiple identical groundtruth-anchor_id-tuples. and of every of these tuples I wand to choose one randomly. – Benjamin M Feb 26 '13 at 16:52
I've created another answer to my own question with a query that does exactly what I need, but is damn slow :( – Benjamin M Feb 26 '13 at 17:05
@BenjaminM: I added yet another answer. `ST_DWithin()` looks promising. And be sure to use a GiST index for that. – Erwin Brandstetter Feb 26 '13 at 17:18
Your query has an error. It has to be `JOIN (SELECT unnest(p_array) AS p) ps`. But it also doesn't work. It just gives me one single row with `id=1`, but it should give me 36 rows. .. because you used `LIMIT 1` ;) Sorry. First think, then read ... i should maybe take a nap. Btw: I have no control over the database, I have to ask my professor to change it... – Benjamin M Feb 26 '13 at 17:25
The `LIMIT 1` thingy isn't doing it. If `p_array` contains two values, then I loose one value in the result. Or maybe I'm just not getting what you want to achieve with it. ?! – Benjamin M Feb 26 '13 at 17:36
You wrote "Select **one** random groundtruth that matches", not "one random groundtruth *for every p* that matches". Out of time for today. Best of luck, I may come back later. – Erwin Brandstetter Feb 26 '13 at 17:52
I wrote "For every input p, I want to receive one(!) random groundtruth within distance d" ;) But okay, I'll keep on trying. At least I have now a query that's working. Slow, but working. Thank you :) – Benjamin M Feb 26 '13 at 17:56
@BenjaminM: I was referring to the comment with the bold emphasis. Added a *final* update for your *yet again* updated requirement. If you are still not there, open a new question with a consolidated case. – Erwin Brandstetter Feb 26 '13 at 23:32
I assume that it's the SQL query at the bottom of your answer: It's not working at all. For each `p` it returns lots of different `groundtruth`. – Benjamin M Feb 26 '13 at 23:54
@BenjaminM: The `DISTINCT ON` clause had an error. Or so I think. It's really unclear what you rally want. – Erwin Brandstetter Feb 26 '13 at 23:59
I've discovered ARRAY_AGG() and so I've built a new query which is at the bottom of my own answer. It's incredible fast (15ms with 4 input values), but it's still missing some filters. It shouldn't be too hard to get it working (i hope). – Benjamin M Feb 27 '13 at 00:10

Benjamin M · Answer 2 · 2013-02-27T00:08:16.037

I now cracked it, but the query is pretty slow...

WITH
  ps AS (
    SELECT unnest(p_array)
    ) AS p
  ),

  gtps AS (
    SELECT DISTINCT ON(ps.p)
      ps.p, m.groundtruth
    FROM measurement m, ps
    WHERE st_distance(m.groundtruth, ps.p) < d
    ORDER BY ps.p, RANDOM()
  )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

My test database contains 22000 rows and I gave it two input values and it takes about 700ms. At the end there can be hundreds of input values :-/

The result now looks like this:

id  | anchor_id | groundtruth | p
-----------------------------------------
20  | 1         | POINT(0 2)  | POINT(1 0)
14  | 3         | POINT(0 2)  | POINT(1 0)
5   | 8         | POINT(0 2)  | POINT(1 0)
42  | 2         | POINT(4 1)  | POINT(2 2)
11  | 3         | POINT(4 8)  | POINT(4 8)
4   | 6         | POINT(4 8)  | POINT(4 8)
1   | 1         | POINT(6 2)  | POINT(7 3)
9   | 5         | POINT(6 2)  | POINT(7 3)
25  | 3         | POINT(6 2)  | POINT(9 1)
13  | 6         | POINT(6 2)  | POINT(9 1)
18  | 7         | POINT(6 2)  | POINT(9 1)

NEW:

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p

Actual result:

p           | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1          | G1          | {1,3,2,..} | {9,8,11,..}
P1          | G2          | {4,3,5,..} | {1,8,23,..}
P1          | G3          | {6,8,9,..} | {12,7,6,..}
P2          | G1          | {6,6,2,..} | {15,2,10,..}
P2          | G4          | {7,9,1,..} | {5,4,3,..}
...         | ...         | ...        | ...

So at the moment I get:

every distinct inputValue-groundtruth-tuple
for every tuple I get an array with all anchor_id corresponding the groundtruth part of the tuple
and an array of all ids corresponding the groundtruth-anchor_id relation

Remember:

two input values can 'select' the same groundtruth
a single groundtruth value can have multiple identical anchor_ids
each groundtruth-anchor_id-tuple has a distinct id

So what's missing for completion?:

I just need a random row for each ps.p
The two arrays belong to each other. Means: the order of the items inside is important!
Those two arrays need to get filtered (randomly):
- For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id

I've now removed both `RANDOM()` statements and the result isn't getting much better: 530ms Vs. 670ms. So I don't think that this is my biggest problem ;) But if I reduce my input array from 2 to 1 value it's getting over(!) twice as fast: 270ms Vs. 670ms. — Benjamin M, Feb 26 '13 at 18:00
If you want performance, you need a GiST index. Without index, Postgres has to calculate the distance to every `p` for every single row in the table. I am talking orders of magnitude here. — Erwin Brandstetter, Feb 26 '13 at 23:10
I have now set a GIST index on `groudtruth` and a btree index on `anchor_id`. But it's not getting faster. I have to use another query somehow... — Benjamin M, Feb 26 '13 at 23:13
I'm now looking at another way of retrieving the data. Actually it's pretty fast, but not doing the randomization and not doing the distinct within `anchor_id`. I'll post it at the bottom of my answer here. — Benjamin M, Feb 26 '13 at 23:31
If the GiST index isn't used, chances are, you are not doing it right. Test with `EXPLAIN ANALYZE` and read the PostGis manual on [`ST_DWithin()`](http://postgis.refractions.net/documentation/manual-svn/ST_DWithin.html) and [Gist indexes](http://postgis.net/docs/manual-2.0/using_postgis_dbmanagement.html#gist_indexes). To make this clear: it's discouraged to change the nature of a question after an answer has been given. You should open a new question in such a case. Best of luck. — Erwin Brandstetter, Feb 27 '13 at 00:08
I use gist index on `groundtruth`. As said above this new query runs 4 input values in 15ms! :) It just need the filters and randomization. — Benjamin M, Feb 27 '13 at 00:14

Select a random entry from a group after grouping by a value (not column)?

Simple version:

Full version:

Solution

2 Answers2

Update according to comment

Faster with `ST_DWithin()`

NEW:

Linked

Select a random entry from a group after grouping by a value (not column)?

Simple version:

Full version:

Solution

2 Answers2

Update according to comment

Faster with ST_DWithin()

NEW:

Linked

Faster with `ST_DWithin()`