1

I wonder if the following script can be optimized somehow. It does write a lot to disk because it deletes possibly up-to-date rows and reinserts them. I was thinking about applying something like "insert ... on duplicate key update" and found some possibilities for single-row updates but I don't know how to apply it in the context of INSERT INTO ... SELECT query.

CREATE OR REPLACE FUNCTION update_member_search_index() RETURNS VOID AS $$
DECLARE
   member_content_type_id INTEGER;
BEGIN
   member_content_type_id :=
      (SELECT id FROM django_content_type
       WHERE app_label='web' AND model='member');

   DELETE FROM watson_searchentry WHERE content_type_id = member_content_type_id;

   INSERT INTO watson_searchentry (engine_slug, content_type_id, object_id
                                 , object_id_int, title, description, content
                                 , url, meta_encoded)
   SELECT 'default',
         member_content_type_id,
         web_member.id,
         web_member.id,
         web_member.name,
         '',
         web_user.email||' '||web_member.normalized_name||' '||web_country.name,
         '',
         '{}'
   FROM web_member
   INNER JOIN web_user ON (web_member.user_id = web_user.id)
   INNER JOIN web_country ON (web_member.country_id = web_country.id)
   WHERE web_user.is_active=TRUE;
END;
$$ LANGUAGE plpgsql;

EDIT: Schemas of web_member, watson_searchentry, web_user, web_country: http://pastebin.com/3tRVPPVi.

The main point is to update columns title and content in watson_searchentry. There is a trigger on the table that sets value of column search_tsv based on these columns.

(content_type_id, object_id_int) in watson_searchentry is unique pair in the table but atm the index is not present (there is no use for it).

This script should be run at most once a day for full rebuilds of search index and occasionally after importing some data.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
clime
  • 8,695
  • 10
  • 61
  • 82
  • Rewriting it in plain SQL seems feasible. (you don't need the `member_content_type_id` variable,, it can be obtained by an extra term in the delete and insert queries) WRT the upsert: just an update, followed by an insert (where not exists) does the trick. Or use a RETURNING construct. – wildplasser Oct 16 '13 at 20:48
  • 1
    Do you really mean `'default',`, or do you mean `DEFAULT` ? IOW: please add the table definitions. – wildplasser Oct 18 '13 at 00:31
  • @wildplasser: I really meant 'default'. I added some more info. – clime Oct 18 '13 at 09:54

1 Answers1

3

Modified table definition

If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:

COLUMN           |          TYPE           |      Modifiers
-----------------+-------------------------+---------------------
 id              | INTEGER                 | NOT NULL DEFAULT ... 
 engine_slug     | CHARACTER VARYING(200)  | NOT NULL DEFAULT 'default'
 content_type_id | INTEGER                 | NOT NULL
 object_id       | text                    | NOT NULL
 object_id_int   | INTEGER                 |
 title           | CHARACTER VARYING(1000) | NOT NULL
 description     | text                    | NOT NULL DEFAULT ''
 content         | text                    | NOT NULL
 url             | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
 meta_encoded    | text                    | NOT NULL DEFAULT '{}'
 search_tsv      | tsvector                | NOT NULL
 ...

DDL statement would be:

ALTER TABLE watson_searchentry ALTER COLUMN  engine_slug DEFAULT 'default';

Etc.

Then you don't have to insert those values manually every time.

Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...

I'll go with your updated requirement:

The main point is to update columns title and content in watson_searchentry

Of course, you must add a UNIQUE constraint to enforce your requirements:

ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)

The accompanying index will be used. By this query for starters.

BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.

Query with data-modifying CTEs

This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.

WITH  ctyp AS (
   SELECT id AS content_type_id
   FROM   django_content_type
   WHERE  app_label = 'web'
   AND    model = 'member'
   )
, sel AS (
   SELECT ctyp.content_type_id
         ,m.id       AS object_id_int
         ,m.id::text AS object_id       -- explicit cast!
         ,m.name     AS title
         ,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
         -- other columns have column default now.
   FROM   web_user    u
   JOIN   web_member  m  ON m.user_id = u.id
   JOIN   web_country c  ON c.id = m.country_id
   CROSS  JOIN ctyp
   WHERE  u.is_active
   )
, del AS (     -- only if you want to del all other entries of same type
   DELETE FROM watson_searchentry w
   USING  ctyp
   WHERE  w.content_type_id = ctyp.content_type_id
   AND    NOT EXISTS (
      SELECT 1
      FROM   sel
      WHERE  sel.object_id_int = w.object_id_int
      )
   )
, up AS (      -- update existing rows
   UPDATE watson_searchentry 
   SET    object_id = s.object_id
         ,title     = s.title
         ,content   = s.content
   FROM   sel s
   WHERE  w.content_type_id = s.content_type_id
   AND    w.object_id_int   = s.object_id_int
   )
               -- insert new rows
INSERT  INTO watson_searchentry (
        content_type_id, object_id_int, object_id, title, content)
SELECT  sel.*  -- safe to use, because col list is defined accordingly above
FROM    sel
LEFT    JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE   w1.content_type_id IS NULL;
  • The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.

  • The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.

  • In the CTE del I avoid deleting rows that can be updated.

  • In the CTE up those rows are updated instead.

  • Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.

Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.

Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.

Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.

Speed first

If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:

CREATE OR REPLACE FUNCTION update_member_search_index()
  RETURNS VOID AS
$func$
DECLARE
   _ctype_id int := (
      SELECT id
      FROM   django_content_type
      WHERE  app_label='web'
      AND    model = 'member'
      );  -- you can assign at declaration time. saves another statement
BEGIN
   DELETE FROM watson_searchentry
   WHERE content_type_id = _ctype_id;

   INSERT INTO watson_searchentry
         (content_type_id, object_id, object_id_int, title, content)
   SELECT _ctype_id, m.id, m.id::int,m.name
         ,u.email || ' ' || m.normalized_name || ' ' || c.name
   FROM   web_member  m
   JOIN   web_user    u USING (user_id)
   JOIN   web_country c ON c.id = m.country_id
   WHERE  u.is_active;
END
$func$ LANGUAGE plpgsql;

I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.

Also:

There is a trigger on the table that sets value of column search_tsv based on these columns.

It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Hey, you avoided the `RETURNING` construct. Good job (and a CTE as well!) – wildplasser Oct 18 '13 at 00:34
  • BTW, the `del` CTE is never referenced. Is there a typo in there, or am I mistreading somehow? – wildplasser Oct 18 '13 at 07:57
  • Never seen such a statement:). But I am getting error atm: `ERROR: failed to find conversion function from unknown to character varying` around line 3. I have added some more info into my question. – clime Oct 18 '13 at 09:56
  • 1
    Well, you did not show the table definitions, so Erwin had to do some guesswork. (not everyone knows the django data model) You'll probably have to add a to/from text/varchar somewhere. – joop Oct 18 '13 at 10:11
  • @wildplasser: Data-modifying CTEs can be useful without ever being referenced. In the updated answer neither of the CTEs `up` or `del` are referenced. They just tap into the same CTE `sel`. One has to remember that these CTEs are executed in parallel, so not rely on a certain sequence of operations. – Erwin Brandstetter Oct 18 '13 at 16:47
  • @clime: The error came from the missing type information in the first version. Would have needed explicit cast like `''::text`. That part is gone completely in the now updated version. – Erwin Brandstetter Oct 18 '13 at 16:49
  • @erwin Have been off during weekend. I really appreciate your answer and gonna test it out now. – clime Oct 21 '13 at 11:38
  • @ErwinBrandstetter: Except for some typos it works nicely (here is my version http://pastebin.com/YyDgm9Yv). The problem is that this query is actually a bit more resource consuming than the one in the question. It takes about 1,5 times longer and cpu utilization is also higher. Disk writes are not really better either (http://postimg.org/image/atsidhwft/). – clime Oct 21 '13 at 20:19
  • @clime: Well, the new query does more work, separating updates from deletes. And CTEs are sometimes optimization barriers, and plpgsql functions save some planning overhead ... I added a version of your function that may be a bit faster and some more advice. And fixed some typos – Erwin Brandstetter Oct 22 '13 at 23:21
  • 1
    Thanks man, you were right about the trigger, removing it reduces the time by 5 secs (from 44s to 39s). Otherwise there is not reallly a difference, which probably means that function cannot be significantly optimized. Thanks for your help. I learned a lot. – clime Oct 23 '13 at 10:59