Need SQL optimization (maybe DISTINCT ON is the reason?)

Question

Related, preceding question:
Select a random entry from a group after grouping by a value (not column)?

My current query looks like this:

WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),

 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

Semantics:

there are two input values:
- Line 4: an array of Points array_of_points
- Line 12: a double precision number: distance
First paragraph (lines 1-6):
- Create a table from the points array for use in...
Second paragraph (lines 8-14):
- For each point inside the points table: get a random(!) groundtruth point from the measurement table, that has a distance < distance
- Save those tuples inside the gtps table
Third paragraph (lines 16-19):
- For each groundtruth value inside the gtps table: get all anchor_id values and...
- If an anchor_id value is not unique: Then choose a random one
Output: id, anchor_id, groundtruth, p (input value from the array_of_points)

Example table:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 8         | POINT(1 4)  | ...
4  | 6         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 6         | POINT(4 3)  | ...
10 | 7         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
-----------------------------------
12 | 1         | POINT(6 2)  | ...
13 | 5         | POINT(6 2)  | ...

Example result:

id  | anchor_id | groundtruth | p
-----------------------------------------
1   | 1         | POINT(1 4)  | POINT(1 0)
2   | 3         | POINT(1 4)  | POINT(1 0)
4   | 6         | POINT(1 4)  | POINT(1 0)
3   | 8         | POINT(1 4)  | POINT(1 0)
5   | 2         | POINT(3 2)  | POINT(2 2)
6   | 4         | POINT(3 2)  | POINT(2 2)
1   | 1         | POINT(1 4)  | POINT(4 8)
2   | 3         | POINT(1 4)  | POINT(4 8)
4   | 6         | POINT(1 4)  | POINT(4 8)
3   | 8         | POINT(1 4)  | POINT(4 8)
12  | 1         | POINT(6 2)  | POINT(7 3)
13  | 5         | POINT(6 2)  | POINT(7 3)
1   | 1         | POINT(4 3)  | POINT(9 1)
11  | 3         | POINT(4 3)  | POINT(9 1)
9   | 6         | POINT(4 3)  | POINT(9 1)
10  | 7         | POINT(4 3)  | POINT(9 1)

As you can see:

Each input value can have multiple equal groundtruth values.
If an input value has multiple groundtruth values, those must all be equal.
Each groundtruth-inputPoint-tuple is connected with every possilbe anchor_id for that groundtruth.
Two different input values can have the same corresponding groundtruth value.
Two distinct groundtruth-inputPoint-tuples can have the same anchor_id
Two indentical groundtruth-inputPoint-tuples must have different anchor_ids

Benchmarks (for two input values):

Lines 1-6: 16ms
Lines 8-14: 48ms
Lines 16-19: 600ms

EXPLAIN VERBOSE:

Unique  (cost=11119.32..11348.33 rows=18 width=72)
  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0)
          Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
          Output: points.p, m.groundtruth, (random())
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                Output: points.p, m.groundtruth, (random())
                Sort Key: points.p, (random())
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                      Output: points.p, m.groundtruth, random()
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                            Output: points.p
                      ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                            Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
        Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
              Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                    Output: gtps.p, gtps.groundtruth
              ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                    Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"

EXPLAIN ANALYZE:

Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                Sort Key: points.p, (random())
                Sort Method: external merge  Disk: 1248kB
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                      ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
        Sort Method: external merge  Disk: 4008kB
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
              ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms

When running live this should run with about 100-1000 input values. So for now it would take 35 to 350 seconds which is far to much.

I already tried to remove the RANDOM() functions. This decreases the runtime (for 2 input values) from about 670ms to about 530ms. So this isn't the main impact at the moment.

It's also possible to run 2 or 3 separate queries and do some parts in software (it's running on a Ruby on Rails server) if that's easier/faster. For example the random selection?!

Work in progress:

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p

With this query it is extremely fast (15ms), but there's missing a lot:

I just need a random row for each ps.p
The two arrays belong to each other. Means: the order of the items inside is important!
Those two arrays need to get filtered (randomly):
For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id

It would also be nice if anchor_id and id could be stored inside an array of tuples. For example: {[4,1],[6,3],[4,2],[8,5],[4,4]} (constraints: every tuple is unique, every id (== 2nd value in the example) is unique, anchor_ids are not unique). This example displays the query without the filters that still must be applied. With the filters applied, it would look like this {[6,3],[4,4],[8,5]}.

Work in progress II:

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()

This now give pretty nice results and is still very fast: 16ms
There's just one thing left to do:

ARRAY_AGG(m.anchor_id) is already randomized, but:
it contains lots of duplicate entries, so:
I'd like to use something like DISTINCT on it, but:
it has to be synchronized with ARRAY_AGG(m.id). This means:
If the DISTINCT command keeps the indices 1, 4 and 7 of the anchor_id array, then it has also to keep indices 1, 4 and 7 of the id array (and of course delete all others)

Please run an `explain analyze` to see where the time is actually spent. — , Feb 26 '13 at 21:08
Always explicitly specify your joins, even `CROSS JOIN`s - however, you may be able to correlate this if you can pre-calc the bounds of the coordinates the maximum distance away (include with `points`). The outer query is -probably- misleading in the way it uses a cross-join - why isn't it based on the point previously collected? I'm not sure that ordering your results with `RANDOM()` is really doing what you want, but am unsure of the inadvisability... — Clockwork-Muse, Feb 26 '13 at 21:30
Concerning `RANDOM()`: The output is exactly what I want. Quote: "*The outer query is -probably- misleading in the way it uses a cross-join - why isn't it based on the point previously collected?*" What do you mean? I just didn't know how to fit all thing in a single query, so I decided to just write down the thing in the order I wanted them to happen... — Benjamin M, Feb 26 '13 at 21:35
I've started another question, because the problem is now very specific concerning postgres and array_agg: http://stackoverflow.com/questions/15102630/distinct-with-two-array-agg-or-one-array-agg-with-tuple-inside — Benjamin M, Feb 27 '13 at 01:31

score 2 · Accepted Answer · edited May 23 '17 at 11:49

2

It would also be nice if anchor_id and id could be stored inside an array of tuples.

Aggreagate function for multi-dimensional arrays

I suppose you create a two-dimensional array for that. That's easier to handle than an ARRAY of record. Standard array_agg() cannot aggregate multi-dimensional arrays. But you can write your own aggregate function rather easily for that:

CREATE AGGREGATE array_agg_mult (anyarray)  (
    SFUNC     = array_cat
   ,STYPE     = anyarray
   ,INITCOND  = '{}'
);

Read the explanation in this related answer:
Selecting data into a Postgres array

For each anchor_id in the array that appears more than once: keep a random one and delete all other. This also means to remove the corresponding id from the id-array for every deleted anchor_id

Query

SELECT DISTINCT ON (p)
       p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
FROM (
   SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
          ps.p, m.groundtruth, m.anchor_id, m.id
   FROM  (SELECT unnest(point_array) AS p) AS ps
   JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
   ORDER  BY ps.p, m.groundtruth, m.anchor_id, random()
   ) x
GROUP  BY p, groundtruth
ORDER  BY p, random();

Subquery x gets distinct anchor_id per (p, groundtruth) and picks a random row if there are multiple peers. This way the connection anchor_id - id stays intact.
The outer query aggregates a 2-dimensional array like you wished for, ordered by anchor_id. If you want to have anchor_id ordered randomly, use random once more:
```
array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
```
And finally, the DISTINCT ON picks only 1 groundtruth per p, randomly again.

edited May 23 '17 at 11:49

Community

1
1

answered Feb 27 '13 at 02:05

Erwin Brandstetter

605,456
145
1,078
1,228

Thanks again Erwin :) You query has a little problem: The result set contains some `p` two (or even three) times. And the number of rows is changing with every call. – Benjamin M Feb 27 '13 at 02:13
@BenjaminM: I see. I think the update should take care of that. – Erwin Brandstetter Feb 27 '13 at 02:20
Now I get one row per `p` as expected, **but** the array just contains one value. But it should contain all tuples where `anchor_id` is distinct. **Example:** Wrong: `{(1,2),(2,3),(1,4)}`, Right: `{(1,2),(2,3)}`. – Benjamin M Feb 27 '13 at 02:24
@BenjaminM: Tricky, tricky, tricky. I think I fixed that, too, now. – Erwin Brandstetter Feb 27 '13 at 02:31
Hehe. But the answer is **NO**. Still the same: number of results is correct, but only one tuple per array :( **Btw:** This version of the query is damn slow: 350ms (vs. 20ms) – Benjamin M Feb 27 '13 at 02:35
@BenjaminM: Unfortunately I don't have an installation of PostGis available. Normally I test before I post, but most of what I posted for you today is from the top of my head, basically. Anyway. I want to see white smoke out of your chimney now, after my (final?) update. – Erwin Brandstetter Feb 27 '13 at 02:46
1

**HELL YEAH!** I thing you've got it! 4 input values. 4 output values. p is distinct. groundtruth isn't distinct. first value in each tuple is distinct. and it takes 17ms. GREAT!! :) – Benjamin M Feb 27 '13 at 02:58
@BenjaminM: Horray! And it's an SQL beauty! Good things come to those who wait. :) – Erwin Brandstetter Feb 27 '13 at 03:00
1

And it seems it doesn't matter how many input values I give it: For 4 values it takes 17ms. And for 10 it takes 34ms and for 25 it also takes 34ms. That's pretty amazing!! P.S. 50 input values take 85ms. It think it can't go much faster ;) P.P.S. my fast solution from above (without filtering anchor_id) takes 82ms for 50 values. So you've done your job best!!! – Benjamin M Feb 27 '13 at 03:04

Need SQL optimization (maybe DISTINCT ON is the reason?)

Work in progress:

Work in progress II:

1 Answers1

Aggreagate function for multi-dimensional arrays

Query

Linked