4

In my Postgres 9.5 database with PostGis 2.2.0 installed, I have two tables with geometric data (points) and I want to assign points from one table to the points from the other table, but I don't want a buildings.gid to be assigned twice. As soon as one buildings.gid is assigned, it should not be assigned to another pvanlagen.buildid.

Table definitions

buildings:

CREATE TABLE public.buildings (
  gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
  osm_id character varying(11),
  name character varying(48),
  type character varying(16),
  geom geometry(MultiPolygon,4326),
  centroid geometry(Point,4326),
  gembez character varying(50),
  gemname character varying(50),
  krsbez character varying(50),
  krsname character varying(50),
  pv boolean,
  gr numeric,
  capac numeric,
  instdate date,
  pvid numeric,
  dist numeric,
  CONSTRAINT buildings_pkey PRIMARY KEY (gid)
);

CREATE INDEX build_centroid_gix
  ON public.buildings
  USING gist
  (st_transform(centroid, 31467));

CREATE INDEX buildings_geom_idx
  ON public.buildings
  USING gist
  (geom);

pvanlagen:

CREATE TABLE public.pvanlagen (
  gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),
  tso character varying(254),
  tso_number numeric(10,0),
  system_ope character varying(254),
  system_key character varying(254),
  location character varying(254),
  postal_cod numeric(10,0),
  street character varying(254),
  capacity numeric,
  voltage_le character varying(254),
  energy_sou character varying(254),
  beginning_ date,
  end_operat character varying(254),
  id numeric(10,0),
  kkz numeric(10,0),
  geom geometry(Point,4326),
  gembez character varying(50),
  gemname character varying(50),
  krsbez character varying(50),
  krsname character varying(50),
  buildid numeric,
  dist numeric,
  trans boolean,
  CONSTRAINT pv_bis2010_bayern_wgs84_pkey PRIMARY KEY (gid),
  CONSTRAINT pvanlagen_buildid_fkey FOREIGN KEY (buildid)
      REFERENCES public.buildings (gid) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid)
);

CREATE INDEX pv_bis2010_bayern_wgs84_geom_idx
  ON public.pvanlagen
  USING gist
  (geom);

Query

My idea was to add a boolean column pv in the buildings table, which is set when a buildings.gid was assigned:

UPDATE pvanlagen 
SET buildid=buildings.gid, dist='50'
FROM buildings
WHERE buildid IS NULL 
AND buildings.pv is NULL
AND pvanlagen.gemname=buildings.gemname 
AND ST_Distance(ST_Transform(pvanlagen.geom,31467)
               ,ST_Transform(buildings.centroid,31467))<50;

UPDATE buildings 
SET pv=true
FROM pvanlagen
WHERE buildings.gid=pvanlagen.buildid;

I tested for 50 rows in buildings but it takes too long to apply for all of them. I have 3.200.000 buildings and 260.000 PV.

The gid of the closest building shall be assigned. If In case of ties, it should not matter which gid is assigned. If we need to frame a rule, we can take the building with the lower gid.

50 meters was meant to work as a limit. I used ST_Distance() because it returns the minimum distance, which should be within 50 meters. Later I raised it multiple times, until every PV Anlage was assigned.

Buildings and PV are assigned to their respective regions (gemname). This should make the assignment cheaper, since I know the nearest building must be within the same region (gemname).

I tried this query after feedback below:

UPDATE pvanlagen p1
SET    buildid = buildings.gid
 , dist = buildings.dist  
FROM (
   SELECT DISTINCT ON (b.gid)
          p.id, b.gid, b.dist::numeric  
   FROM  (
      SELECT id, ST_Transform(geom, 31467) 
      FROM   pvanlagen
      WHERE  buildid IS NULL  -- not assigned yet
      ) p
        , LATERAL (
      SELECT b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist
      FROM   buildings      b
      LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid  
      WHERE  p1.buildid IS NULL                        
      AND    b.gemname = p1.gemname
      ORDER  BY ST_Transform(p1.geom, 31467) <-> ST_Transform(b.centroid, 31467)
      LIMIT  1
            ) b
       ORDER  BY b.gid, b.dist, p.id  -- tie breaker
       ) x, buildings
 WHERE   p1.id = x.id;

But it returns with 0 rows affected in 234 ms execution time.
Where am I going wrong?

NewbieNeedsHelp
  • 165
  • 1
  • 1
  • 9

1 Answers1

5

Table schema

To enforce your rule simply declare pvanlagen.buildid UNIQUE:

ALTER TABLE pvanlagen ADD CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid);

building.gid is the PK, as your update revealed. To also enforce referential integrity add a FOREIGN KEY constraint to buildings.gid.

You have implemented both by now. But it would be more efficient to run the big UPDATE below before you add these constraints.

There is a lot more that should be improved in your table definition. For one, buildings.gid as well as pvanlagen.buildid should be type integer (or possibly bigint if you burn a lot of PK values). numeric is expensive nonsense.

Let's focus on the core problem:

Basic Query to find closest building

The case is not as simple as it may seem. It's a "nearest neighbour" problem, with the additional complication of unique assignment.

This query finds the nearest one building for each PV (short for PV Anlage - row in pvanlagen), where neither is assigned, yet:

SELECT pv_gid, b_gid, dist
FROM  (
   SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
   FROM   pvanlagen
   WHERE  buildid IS NULL  -- not assigned yet
   ) p
     , LATERAL (
   SELECT b.gid AS b_gid
        , round(ST_Distance(p.geom31467
                      , ST_Transform(b.centroid, 31467))::numeric, 2) AS dist  -- see below
   FROM   buildings b
   LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid  -- also not assigned ...
   WHERE  p1.buildid IS NULL                       -- ... yet  
   -- AND    p.gemname = b.gemname                 -- not needed for performance, see below
   ORDER  BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
   LIMIT  1
   ) b;

To make this query fast, you need a spatial, functional GiST index on buildings to make it much faster:

CREATE INDEX build_centroid_gix ON buildings USING gist (ST_Transform(centroid, 31467));

Not sure why you don't

Related answers with more explanation:

Further reading:

With the index in place, we don't need to restrict matches to the same gemname for performance. Only do this if it's an actual rule to enforced. If it has to be observed at all times, include the column in the FK constraint:

Remaining Problem

We can use the above query it in an UPDATE statement. Each PV is only used once, but more than one PV might still find the same building to be closest. You only allow one PV per building. So how would you resolve that?

In other words, how would you assign objects here?

Buildings and PV diagram

Simple solution

One simple solution would be:

UPDATE pvanlagen p1
SET    buildid = sub.b_gid
     , dist    = sub.dist  -- actual distance
FROM  (
   SELECT DISTINCT ON (b_gid)
          pv_gid, b_gid, dist
   FROM  (
      SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
      FROM   pvanlagen
      WHERE  buildid IS NULL  -- not assigned yet
      ) p
        , LATERAL (
      SELECT b.gid AS b_gid
           , round(ST_Distance(p.geom31467
                         , ST_Transform(b.centroid, 31467))::numeric, 2) AS dist  -- see below
      FROM   buildings      b
      LEFT   JOIN pvanlagen p1 ON p1.buildid = b.gid  -- also not assigned ...
      WHERE  p1.buildid IS NULL                       -- ... yet  
      -- AND    p.gemname = b.gemname                 -- not needed for performance, see below
      ORDER  BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
      LIMIT  1
      ) b
   ORDER  BY b_gid, dist, pv_gid  -- tie breaker
   ) sub
WHERE   p1.gid = sub.pv_gid;

I use DISTINCT ON (b_gid) to reduce to exactly one row per building, picking the PV with the shortest distance. Details:

For any building that is closest for more one PV, only the closest PV is assigned. The PK column gid (alias pv_gid) serves as tiebreaker if two are equally close. In such a case, some PV are dropped from the update and remain unassigned. Repeat the query until all PV are assigned.

This is still a simplistic algorithm, though. Looking at my diagram above, this assigns building 4 to PV 4 and building 5 to PV 5, while 4-5 and 5-4 would probably be a better solution overall ...

Aside: type for dist column

Currently you use numeric for it. your original query assigned a constant integer, no point making in numeric.

In my new query ST_Distance() returns the actual distance in meters as double precision. If we simply assign that we get 15 or so fractional digits in the numeric data type, and the number is not that exact to begin with. I seriously doubt you want to waste the storage.

I would rather save the original double precision from the calculation. or, better yet, round as needed. If meters are exact enough, just cast to and save an integer (rounding the number automatically). Or multiply with 100 first to save cm:

(ST_Distance(...) * 100)::int
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @ErwinBrandstetter, +1 for more detailed answer. – MAK Jan 01 '16 at 06:09
  • The query works, but the efficiency follows a logarithmic curve. First run 158376 rows affected, 02:40:7247 hours execution time. Current run (16th run) 984 rows affected, 06:12:21648 hours execution time (Altogether 72h execution time by now). 9788 unassigned rows are left. I am already working with the GiST Index you suggested. Any ideas on how to fasten things up? In the end I will cut off PV Anlagen with a `buildid NOT NULL AND dist>1000`. Last run 835 of 984 had `dist>1000`. Can we consider this in the query to fasten things up? – NewbieNeedsHelp Jan 17 '16 at 14:22
  • @NewbieNeedsHelp: These numbers seems *way* too slow. You probably need to configure your db server (among other things?). More `work_mem` would be my first *guess*. You might post another question about performance on dba.SE, but consider [instructions for postgresql performance questions first](http://dba.stackexchange.com/tags/postgresql-performance/info). – Erwin Brandstetter Jan 17 '16 at 17:15