0

I have a table with 3.7 million rows. It has a boolean value (called bool_value below) that already has a value for all rows.

I am using this query to udpate bool_value:

DROP INDEX IF EXISTS table_bool_value_idx;

UPDATE table
SET bool_value = (
st_contains(st_setsrid((SELECT geom FROM gis.cntys04_3081 WHERE gid=57), 3081), the_geom)
AND st_contains((SELECT geom_3081 FROM gis.stratmap_city_poly_v4 WHERE gid = 127), the_geom)
);

This query has run for about 12 hours so far. For the entire time, my hard drive has 100% utilization.

I don't understand this hard drive utilization as the table has been vacuumed within the past month, bool_value already has a record for each row, and I dropped the index. Both of the tables in the gis namespace (referenced in the query) are indexed on their gid field. Those gis namespace tables are not views. They are just SHPs that I imported using PostGIS's PostGIS 2.0 Shapefile and DBF Loader Exporter. Nothing else significant is happening on the computer.

Per Task Manager, CPU utilization is next to nothing, even on the postgres.exe process. I assume this is due to the extreme hard drive activity.

The computer has 8GB total. Almost half of that is free. There are 14 concurrent postgres.exe processes open, and they have between 2.8 MB and 9 MB each. Only one of the processes is showing much activity.

Here's the one thing I wonder about: the geom_3081 from the second st_contains is 1.0 MB long. It's a PostGIS MULTIPOLYGON representing the city limits of Dallas, TX. I imagine this would be cached by, if nothing else, the Windows disk cache, preventing repeated hard drive accesses. geom_3081 is not indexed, although I don't see why indexing it would help as it's not being used to determine which rows to return; the gid field is doing that.

Aren Cambre
  • 6,540
  • 9
  • 30
  • 36
  • A couple of thoughts: is anything else using that table in the meantime (since Postgres will version, rather than lock, there will be two versions of the whole table until the query completes)? Have you looked at the query plan (maybe the nested queries have led to an odd access pattern)? – IMSoP Mar 17 '13 at 22:03
  • Nothing else is using the table. Even if something else was using it, the table doesn't have that much data per row, so duplicating the table wouldn't take 12 hours. My most recent **.backup** file of the table is just over 500 MB. – Aren Cambre Mar 17 '13 at 22:12
  • I don't see your PostgreSQL version and your table definition anywhere in the question? – Erwin Brandstetter Mar 17 '13 at 23:17

1 Answers1

1

Whatever other problems exist here (and most probably there are others), this should make your query faster:

UPDATE table t
SET    bool_value = x.new_bool
FROM (
   SELECT table_id
         ,(st_contains(st_setsrid
               ((SELECT geom FROM gis.cntys04_3081 WHERE gid=57)
                 ,3081
               ), the_geom)
           AND st_contains
              ((SELECT geom_3081 FROM gis.stratmap_city_poly_v4 WHERE gid = 127)
              ,the_geom
              )
          ) AS new_bool
   FROM   table
   ) upd
WHERE  t.table_id = upd.table_id
AND    t.bool_value IS DISTINCT FROM x.new_bool;

table_id is the primary key here.

The principal is to avoid empty updates. And since we are talking about a boolean value, chances are, you have a large portion of updates that do not actually change anything. But such an update still results in disk activity, since the old row is marked obsolete an an new (unchanged) row inserted.

To figure out what else is going wrong, you could go the proven path of divide and conquer: Append to the subquery:

ORDER  BY table_id
LIMIT  10
OFFEST 0

If that goes through, keep incrementing LIMIT and OFFSET, with bigger slices. Increase the size of the slices as long as performance stays good. Be sure to run in distinct transactions.

This might avoid caching issues (as indicated by the excessive disk load) and also avoid any concurrency issues you might have. If a updating a few rows is still slow, you definitely have other issues.


As for:

I don't understand this hard drive utilization as the table has been vacuumed within the past month,

That's a misconception. VACUUM does not normally make UPDATE faster (except for extreme cases of table bloat). Maybe even slower. Even worse with VACUUM FULL. That removes all dead tuples and packs data pages with live tuples as tight as possible.

Following UPDATEs cannot utilize HOT updates, since there is no "wiggle room" on individula data pages. If you have lots of updates, consider a lower setting for fillfactor.

Find more information and links in these closely related questions on dba.SE here and here and here.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you for the detailed answer. The query ended up finishing after about 16 hours, so I wasn't able to employ your suggestion, but it sounds reasonable. – Aren Cambre Mar 18 '13 at 01:19