I am having some trouble with the following operation: I have a database table called entries which (for all intents and purposes) has 3 columns in addition to the primary key: value
, gps_lat
, gps_long
all of which are doubles.
My ultimate goal is to be able to define a grid, say 100x100 with an interval and bounded by a given latitude and longitude value and for each square of the grid I want to compute the average value of all the points in that grid square. I am having a lot of trouble doing this efficiently however.
Part of the problem is that I want to set this up either as a stored procedure or as a query that I can generate with a piece of code and reuse later because every time I run the query the grid will not be the same (so caching is pretty much out the question).
My first attempt at doing this was to define the following function:
CREATE OR REPLACE FUNCTION gridSquareAverageValue (double precision
, double precision, double precision, double precision)
RETURNS double precision as $avgValue$
declare
avgValue double precision;
BEGIN
SELECT AVG(value) into avgValue FROM entries
WHERE gps_lat BETWEEN $1 AND $2 AND gps_long BETWEEN $3 AND $4;
RETURN avgValue;
END;
$avgValue$ LANGUAGE plpgsql;
This function works very well and does exactly what I need it to do, except that it does it for only one grid square. Running the function for a 100x100 grid involves 10,000 individual queries and is therefore inordinately slow.
The next attempt was this:
WITH Grid(lat_offset,long_offset) AS
(SELECT *
FROM generate_series(1,10) lat_offset
CROSS JOIN generate_series(1,10) long_offset)
SELECT AVG(value)
FROM Grid
JOIN entries
ON entries.gps_lat BETWEEN 41.79604807005128 + (0.000247908106797 * Grid.lat_offset)
AND 41.82083888073101 + (0.002479081067973 * (Grid.lat_offset + 1))
AND entries.gps_long BETWEEN -72.2759199142456 + (0.000527858734131 * Grid.long_offset)
AND -72.22313404083252 + (0.005278587341308 * (Grid.long_offset + 1))
GROUP BY lat_offset,long_offset;
This somehow turned out to be even worse. I attempted to generate a sequence of offsets and then join it with the table of entries forcing each entry into a box that is calculated with the math you can see above. This is impossibly slow. I tried to get it to just output the values without computing averages and it took even longer than running 10k individual queries.
The above is also probably the most promising approach because all I really want to do after generating a cartesian join of two series is to use them in a simple function, but I cannot figure out any decent way to do that except what you see above =/
Finally I tried this:
# $1 height $2 width $3 lat start $4 lat interval $5 long start $6 long interval
CREATE OR REPLACE FUNCTION gridAverageValue (integer, integer, double precision, double precision, double precision, double precision)
RETURNS TABLE (avg double precision) as $restbl$
BEGIN
SELECT * INTO $restbl$ FROM entries WHERE 1 = 2;
FOR lat_offset IN 0..$1 LOOP
FOR long_offset IN 0..$2 LOOP
INSERT INTO restbl
SELECT AVG(value)
FROM entries
WHERE gps_lat
BETWEEN $3 + ($4 * lat_offset) AND $3 + ($4 * (lat_offset + 1))
AND gps_long
BETWEEN $5 + ($6 * long_offset) AND $5 + ($6 * (long_offset + 1));
END LOOP;
END LOOP;
RETURN QUERY SELECT * FROM restbl;
END;
$restbl$ LANGUAGE plpgsql;
This final attempt is getting a bunch of syntax errors and I honestly do not know where it is coming from. The general idea is to generate a bunch of queries that ultimately compute the values I care about.
If anyone has a suggestion about how to fix any of the approaches above, that would be greatly appreciated.