Assign point with largest population from set of points contained in a polygon

Question

I'm trying to add a "population-based-centroid" column to a series of U.S. county polygons, where the location is not based on the geographic centroid of the polygon, but rather on the location of the geonames populated place with the largest population. For example, I want to assign the geometry of the arrow-indicated point (point diameter = population) to the selected polygon's population-based-centroid column:

enter image description here

I've tested this query, and it returns the correct geometry for any given polygon (Boston's Suffolk County, for example):

SELECT g1.the_geom
FROM counties c1
JOIN geonames g1
ON ST_Contains(c1.the_geom, g1.the_geom)
WHERE c1.name = 'Suffolk County, MA'
ORDER BY g1.population DESC
LIMIT 1;

However, I'm dealing with ~4000 polygons, and when I try to use the query in an UPDATE function like this it hangs indefinitely (or at least far longer than it should for this number of features):

UPDATE counties
    SET the_geom_popcentroid = (
        SELECT g1.the_geom
        FROM counties c1
        JOIN geonames g1
        ON ST_Contains(c1.the_geom, g1.the_geom)
        ORDER BY g1.population DESC
        LIMIT 1
    );

Where have I nested this UPDATE function incorrectly?

Yeah, that's the point. I'm trying to find the point with the biggest population in each of about 4000 polygon features. Is there a more efficient way? Maybe a loop of some kind? — Bill Morris, Jul 28 '14 at 16:25
I'm not familiarized with the ST_Contains function, maybe if you add indexes on these columns `c1.the_geom, g1.the_geom` it will improve your query. Run an execution plan of your update query. Also, maybe you can repost this on http://dba.stackexchange.com maybe you get more attention. — Jorge Campos, Jul 28 '14 at 16:39
I think the previous point about updating the whole country was the relevant bit. You need a group by counties in the inner query and an ID for each county so you update each one separately. — John Powell, Jul 28 '14 at 16:43

score 3 · Answer 1 · edited May 23 '17 at 10:30

3

On closer inspection: Since there is no connection between each individual row in the outer table to the result of the correlated subquery, you get a constant value for every row. What's puzzling: This should not be slow, but rather surprisingly fast. Also completely incorrect. To fix the query you have:

UPDATE counties c
SET    the_geom_popcentroid = (
        SELECT g.the_geom
        FROM   geonames g
        WHERE  ST_Contains(c.the_geom, g.the_geom)
        ORDER  BY g.population DESC
        LIMIT  1
        );

This updates all counties. If a county should not contain any geoname at all, the_geom_popcentroid is set to NULL.

This alternative version with JOIN syntax only updates counties that contain at least one geoname:

UPDATE counties c
SET    the_geom_popcentroid = sub.the_geom
FROM (
    SELECT DISTINCT ON (c1.pk)
           c1.pk, g1.the_geom
    FROM   counties c1
    JOIN   geonames g1 ON ST_Contains(c1.the_geom, g1.the_geom)
    ORDER  BY c1.pk, g1.population DESC
    ) sub
WHERE c.pk = sub.pk;

Where pk is the primary key column (or any unique column) of counties.

Explanation for DISTINCT ON:

Select first row in each GROUP BY group?

Not sure which is faster. The combination of ST_Contains() with ORDER BY third_column LIMIT n can be tricky while trying to use indexes on the two the_geom columns. Test with EXPLAIN ANALYZE if it matters.

Sometimes a LATERAL JOIN can help to convince Postgres to use indexes. Related question:

Indexes for SQL query with WHERE condition and GROUP BY

edited May 23 '17 at 10:30

Community

1
1

answered Jul 28 '14 at 16:43

Erwin Brandstetter

605,456
145
1,078
1,228

1

And have the right results in the right place as an added bonus. – John Powell Jul 28 '14 at 16:45
@JohnBarça: Minor bonus. ;) – Erwin Brandstetter Jul 28 '14 at 16:46
You also forgot the order by population desc in the sub query :D – John Powell Jul 28 '14 at 16:49
That's not doing what the question asked for. The population isn't taken into account in there at all. – Darrell Fuhriman Jul 28 '14 at 16:51
@DarrellFuhriman: Updated with a working version. My oversight was that we select `the_geom`, but order by a different column (`population`). Thanks for the reminder to John as well. I was mislead by the title asking for `the largest value from a set of points` ... Going to improve that, too. – Erwin Brandstetter Jul 28 '14 at 16:55
Wouldn't it be distinct on (c1.counties) and order by g1.population to get top population by county? – John Powell Jul 28 '14 at 17:02
@John: Basically yes. `c1.pk` is supposed to be the primary key column of `countries` as stated below. But we need to order by `c1.pk` first - and then by `g1.population`. The leading `ORDER BY` items have to agree with the `DISTINCT ON` items. More explanation in the linked answer. – Erwin Brandstetter Jul 28 '14 at 17:06
The "alternative" above works on a subset of features with one edit - the last line should be: `WHERE c.pk = sub.pk;` – Bill Morris Jul 28 '14 at 17:39
Thanks for the background info linked above as well! – Bill Morris Jul 28 '14 at 19:35
1

@ErwinBrandstetter. Thanks for info on distinct on for these purposes and performance and the hint to use lateral joins. It all gets a bit murky when spatial indexes are thrown into the mix. – John Powell Jul 28 '14 at 20:34

Darrell Fuhriman · Accepted Answer · 2014-07-28T17:38:34.557

1

It helps to use a window function here:

 WITH max_pop as (
  SELECT DISTINCT c.id as county_id,
    first_value(g.the_geom) OVER (PARTITION BY c.name ORDER BY g.population DESC) as the_geom
  FROM counties c
  JOIN geonames g
    ON ST_Intersects(c.the_geom,g.the_geom)
  )
UPDATE counties
SET pop_center_geom=max_pop.the_geom
FROM max_pop
WHERE counties.id=max_pop.county_id;

What we're doing:

Ordering the cities in each county by population descending, then taking the geometry of the first one and the id of the county in which it resides.

We then update the county table using the id and the geometry we got.

I prefer this to DISTINCT ON method mentioned because to me it's more explicit about what's happening and relies less on "side-effects" (for lack of a better word).

edited Jul 28 '14 at 17:38

answered Jul 28 '14 at 17:25

Darrell Fuhriman

241
1
3

There are no "side-effects" in the `DISTINCT ON` statement. It's only downside, if you will, it's not standard SQL (as opposed to this valid alternative). `DISTINCT ON` is typically shorter and easier to understand IMO. It's also faster. I tested many times. – Erwin Brandstetter Jul 28 '14 at 17:42
This one did the trick as well. Dang; wish I could vote twice. – Bill Morris Jul 28 '14 at 19:26
1

@BillMorris: Darell didn't axe the window function, just moved the window definition to the `OVER` clause, which is shorter syntax for a single window function (and no difference in performance). [Details in the manual.](http://www.postgresql.org/docs/current/interactive/sql-select.html#SQL-WINDOW) – Erwin Brandstetter Jul 28 '14 at 19:52

Assign point with largest population from set of points contained in a polygon

2 Answers2