Performing Multiple Avg Queries in PostgreSQL

Question

I am having some trouble with the following operation: I have a database table called entries which (for all intents and purposes) has 3 columns in addition to the primary key: value, gps_lat, gps_long all of which are doubles.

My ultimate goal is to be able to define a grid, say 100x100 with an interval and bounded by a given latitude and longitude value and for each square of the grid I want to compute the average value of all the points in that grid square. I am having a lot of trouble doing this efficiently however.

Part of the problem is that I want to set this up either as a stored procedure or as a query that I can generate with a piece of code and reuse later because every time I run the query the grid will not be the same (so caching is pretty much out the question).

My first attempt at doing this was to define the following function:

CREATE OR REPLACE FUNCTION gridSquareAverageValue (double precision
             , double precision, double precision, double precision)
RETURNS double precision as $avgValue$
declare
    avgValue double precision;
BEGIN
    SELECT AVG(value) into avgValue FROM entries
    WHERE gps_lat BETWEEN $1 AND $2 AND gps_long BETWEEN $3 AND $4;
    RETURN avgValue;
END;
$avgValue$ LANGUAGE plpgsql;

This function works very well and does exactly what I need it to do, except that it does it for only one grid square. Running the function for a 100x100 grid involves 10,000 individual queries and is therefore inordinately slow.

The next attempt was this:

WITH Grid(lat_offset,long_offset) AS
(SELECT *
 FROM       generate_series(1,10) lat_offset
 CROSS JOIN generate_series(1,10) long_offset)
SELECT AVG(value)
FROM Grid 
JOIN entries 
ON entries.gps_lat BETWEEN 41.79604807005128 + (0.000247908106797 * Grid.lat_offset)
                       AND 41.82083888073101 + (0.002479081067973 * (Grid.lat_offset + 1))
AND entries.gps_long BETWEEN -72.2759199142456 + (0.000527858734131 * Grid.long_offset)
                         AND -72.22313404083252 + (0.005278587341308 * (Grid.long_offset + 1))
GROUP BY lat_offset,long_offset;

This somehow turned out to be even worse. I attempted to generate a sequence of offsets and then join it with the table of entries forcing each entry into a box that is calculated with the math you can see above. This is impossibly slow. I tried to get it to just output the values without computing averages and it took even longer than running 10k individual queries.

The above is also probably the most promising approach because all I really want to do after generating a cartesian join of two series is to use them in a simple function, but I cannot figure out any decent way to do that except what you see above =/

Finally I tried this:

#                                           $1 height $2 width $3 lat start      $4 lat interval   $5 long start      $6 long interval
CREATE OR REPLACE FUNCTION gridAverageValue (integer,  integer, double precision, double precision, double precision, double precision)
RETURNS TABLE (avg double precision) as $restbl$
BEGIN
    SELECT * INTO $restbl$ FROM entries WHERE 1 = 2;
    FOR lat_offset IN 0..$1 LOOP
        FOR long_offset IN 0..$2 LOOP
            INSERT INTO restbl 
            SELECT AVG(value) 
            FROM entries 
            WHERE gps_lat 
            BETWEEN $3 + ($4 * lat_offset) AND $3 + ($4 * (lat_offset + 1)) 
            AND gps_long 
            BETWEEN $5 + ($6 * long_offset) AND $5 + ($6 * (long_offset + 1));
        END LOOP;
    END LOOP;
    RETURN QUERY SELECT * FROM restbl;
END;
$restbl$ LANGUAGE plpgsql;

This final attempt is getting a bunch of syntax errors and I honestly do not know where it is coming from. The general idea is to generate a bunch of queries that ultimately compute the values I care about.

If anyone has a suggestion about how to fix any of the approaches above, that would be greatly appreciated.

About the failed attempt in your last function: it seems you are misunderstanding dollar-quoting: http://stackoverflow.com/questions/12144284/what-are-used-for-in-pl-pgsql/12172353#12172353 — Erwin Brandstetter, Apr 15 '15 at 21:37

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Only populated cells

Use the built-in function width_bucket() to get only grid cells with one or more matching rows in entries:

For a grid of 100 x 100 cells in the outer frame of box(point(_lat_start, _long_start), point(_lat_end, _long_end)):

SELECT width_bucket(gps_lat , _lat_start , _lat_end , 100) AS grid_lat
     , width_bucket(gps_long, _long_start, _long_end, 100) AS grid_long
     , avg(value) AS avg_val
FROM   entries
WHERE  point(gps_lat, gps_long) <@ box(point(_lat_start, _long_start)
                                     , point(_lat_end  , _long_end))
GROUP  BY 1,2
ORDER  BY 1,2;

<@ is the "contained in" operator for geometric types.

It's easy to wrap this into a function and parameterize the outer box and the number of grid cells.

A multicolumn GiST expression index will help performance if only a small fraction of rows lies within the outer box. You'll need to install the btree_gist module first, once per database:

PostgreSQL EXCLUDE USING error: Data type integer has no default operator class

Then:

CREATE INDEX entries_point_idx ON entries
USING gist (point(gps_lat, gps_long), value);

Adding value to the index only makes sense if you can get an index-only scan out of this in Postgres 9.2+.

If you are reading large parts of the table anyway, you don't need an index and it might be cheaper to run simple a between x and y checks in the WHERE clause.

This is assuming a flat earth (which may be good enough for your purpose). If you want to be precise, you will have to dig deeper into PostGIS.

All cells in the grid

To get all cells use LEFT JOIN to a pre-generated grid like you already tried:

SELECT grid_lat, grid_long, g.avg_val  -- or use COALESCE
FROM        generate_series(1,100) grid_lat
CROSS  JOIN generate_series(1,100) grid_long
LEFT   JOIN (<query from above>) g USING (grid_lat, grid_long)

Aggregating (x,y) coordinate point clouds in PostgreSQL

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 15 '15 at 20:07

Erwin Brandstetter

605,456
145
1,078
1,228

This looks kind of like the direction I wanted to go, and that's so much for helping out. My only concern is that when I tried to call the first query you suggested (the one with the <@ geometric operator) it tells me: "ERROR: function box(record, record) does not exist LINE 5: WHERE point(gps_lat, gps_lat) <@ box((41.79604807005128, -7... HINT: No function matches the given name and argument types. You might need to add explicit type casts." Do I need to install something in order to get the box function working? – RedHack Apr 15 '15 at 20:52
@RedHack: Ah, right, when I switched from a *box literal* to a box constructed from parameters I forgot to add the `point()` functions. Try the update. It should work out of the box (no pun intended). – Erwin Brandstetter Apr 15 '15 at 20:56
That's for the help, but I think I may be misunderstanding what its trying to do because when I run it, not matter how large I make the box (i.e. I am certain that there are points in there) but it always returns nothing. It may be caused by my misunderstanding the boundaries of the box: I was under the impression that lat_start and long_start should be smaller than lat_end and long_end. Is this correct? – RedHack Apr 15 '15 at 21:09
@RedHack: Yes, correct. I had another typo: `point(gps_lat, gps_lat)` instead of `... long)`, that's fixed already, you didn't stumble across that, did you? – Erwin Brandstetter Apr 15 '15 at 21:11
No I caught that one, but I confused the order of the negative longitude start/end and that fixed it. Thank you so much again, this looks like exactly what I needed. Now (again I apologize for asking too many questions, but if you have the time:) I wanted to ask about the last statement concerning all cells. What do you mean by _all_ cells? Do you mean all cells that could possibly be generated by all the entries in the database? – RedHack Apr 15 '15 at 21:15
@RedHack: All cells in the grid (100x100 in the example), while the first variant only returns cells with matching entries. Nothing outside the bounding box. It's actually very cheap to add the "fillers", so the 2nd query doesn't cost much more than the 1st. Number of grid cells has to match, obviously. – Erwin Brandstetter Apr 15 '15 at 21:18
Oh ok, it makes sense now. The second query just adds fillers in case there is no data? – RedHack Apr 15 '15 at 21:22

Performing Multiple Avg Queries in PostgreSQL

1 Answers1

Only populated cells

All cells in the grid