0

I have the following tables:

SELECT * FROM labels LIMIT 5;
 user_id | session_id |    start_timestamp     |     end_timestamp      | travelmode 
---------+------------+------------------------+------------------------+------------
      11 |          0 | 2007-06-26 12:32:29+01 | 2007-06-26 12:40:29+01 | bus
      11 |          0 | 2008-03-31 17:00:08+01 | 2008-03-31 17:09:01+01 | taxi
      11 |          0 | 2008-04-01 01:48:32+01 | 2008-04-01 01:59:23+01 | taxi
      11 |          0 | 2008-04-01 02:00:22+01 | 2008-04-01 02:08:13+01 | walk
      11 |          0 | 2008-04-01 12:22:47+01 | 2008-04-01 12:28:39+01 | taxi
(5 rows)

SELECT * FROM trajectories LIMIT 5;
 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt  
---------+-------------------+------------------------+-----------+------------+------
      11 | 10020080330004134 | 2008-03-30 00:41:34+00 | 36.032647 | 103.850612 | -777
      11 | 10020080330004134 | 2008-03-30 00:42:05+00 | 36.031563 | 103.851273 | -777
      11 | 10020080330004134 | 2008-03-30 00:43:04+00 | 36.028623 | 103.853238 | -777
      11 | 10020080330004134 | 2008-03-30 00:44:03+00 | 36.027323 | 103.854475 | -777
      11 | 10020080330004134 | 2008-03-30 00:45:02+00 | 36.025775 | 103.854993 | -777
(5 rows)

So I want to update the session_id column of the labels table(initially all zero):

UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE 
  trajectories.user_id = labels.user_id
  AND trajectories.timestamp >= labels.start_timestamp 
  AND trajectories.timestamp <= labels.end_timestamp;

UPDATE 4500

However, not all columns of the labels table were updated (about 30% only) after about 5-minutes of query execution:

 SELECT COUNT(*) FROM labels WHERE session_id=0;
 count 
-------
 10217
(1 row)

If it might helps, more about the tables:

\d labels
                             Table "akil.labels"
     Column      |           Type           | Collation | Nullable | Default 
-----------------+--------------------------+-----------+----------+---------
 user_id         | integer                  |           | not null | 
 session_id      | bigint                   |           |          | 
 start_timestamp | timestamp with time zone |           | not null | 
 end_timestamp   | timestamp with time zone |           | not null | 
 travelmode      | text                     |           |          | 
Indexes:
    "mode_pkey" PRIMARY KEY, btree (user_id, start_timestamp, end_timestamp)

 \d trajectories 
                       Table "akil.trajectories"
   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 user_id    | integer                  |           |          | 
 session_id | bigint                   |           | not null | 
 timestamp  | timestamp with time zone |           | not null | 
 lat        | double precision         |           | not null | 
 lon        | double precision         |           | not null | 
 alt        | double precision         |           |          | 
Indexes:
    "trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon)

EDIT

Adding index to trajectories table:

 CREATE INDEX idx_trecj ON trajectories (user_id, timestamp);
CREATE INDEX

UPDATE labels
SET session_id=trajectories.session_id
FROM trajectories
WHERE 
  trajectories.user_id = labels.user_id
  AND trajectories.timestamp >= labels.start_timestamp 
  AND trajectories.timestamp <= labels.end_timestamp;

UPDATE 4500

 SELECT COUNT(*) FROM labels WHERE session_id=0;
 count 
-------
 10217
(1 row

Yet, not all session_id in labels are updated (same as the initial operation).

EDIT

Result of queries from the answer by @Erwin (this shows many none):

SELECT count(*) AS no_match_in_trajectories
FROM   akil.labels l
WHERE  NOT EXISTS (
   SELECT FROM akil.trajectories t
   WHERE  t.user_id   =  l.user_id
   AND    t.timestamp >= l.start_timestamp 
   AND    t.timestamp <= l.end_timestamp
   );
no_match_in_trajectories 
--------------------------
                    10217
(1 row)

In the second query (suggested way to do the update, none was updated);

UPDATE akil.labels l
SET    session_id = t.session_id
FROM   akil.labels l1
CROSS  JOIN   LATERAL (
   SELECT t1.session_id
   FROM   akil.trajectories t1
   WHERE  t1.user_id   =  l1.user_id
   AND    t1.timestamp >= l1.start_timestamp 
   AND    t1.timestamp <= l1.end_timestamp
   ORDER  BY t1.timestamp  -- pick the earliest matching entry
   LIMIT  1
   ) t
WHERE ( l.user_id,  l.start_timestamp,  l.end_timestamp)  -- PK
   =  (l1.user_id, l1.start_timestamp, l1.end_timestamp)
AND    l.session_id IS DISTINCT FROM t.session_id;

UPDATE 0
arilwan
  • 3,374
  • 5
  • 26
  • 62

2 Answers2

1

search_path

Your table definition reveals:

Table "akil.labels"

So the table lives in schema akil. Is the search_path set correctly at all times, so that you can afford to omit schema qualification in your queries? Else, you might be updating the wrong table by accident / from the wrong table - which could explain your results. See:

But there are other possible explanations.

No match in akil.trajectories

What makes you think there is an applicable row in trajectories for every row in labels?

This many have none:

SELECT count(*) AS no_match_in_trajectories
FROM   akil.labels l
WHERE  NOT EXISTS (
   SELECT FROM akil.trajectories t
   WHERE  t.user_id   =  l.user_id
   AND    t.timestamp >= l.start_timestamp 
   AND    t.timestamp <= l.end_timestamp
   );

Multiple matches in akil.trajectories

On a related note: What makes you think there would be exactly one applicable row? Your UPDATE produces arbitrary results (in an expensive way) if there are more than one applicable rows in trajectories.

Given your table definitions, this would be a proper way to do it:

UPDATE akil.labels l
SET    session_id = t.session_id
FROM   akil.labels l1
CROSS  JOIN   LATERAL (
   SELECT t1.session_id
   FROM   akil.trajectories t1
   WHERE  t1.user_id   =  l1.user_id
   AND    t1.timestamp >= l1.start_timestamp 
   AND    t1.timestamp <= l1.end_timestamp
   ORDER  BY t.timestamp  -- pick the earliest matching entry
   LIMIT  1
   ) t
WHERE ( l.user_id,  l.start_timestamp,  l.end_timestamp)  -- PK
   =  (l1.user_id, l1.start_timestamp, l1.end_timestamp)
AND    l.session_id IS DISTINCT FROM t.session_id;        -- avoids empty updates

About avoidable empty updates:

The perfect index for this would be:

CREATE INDEX foo ON akil.trajectories (user_id, timestamp) INCLUDE (session_id)

INCLUDE requires Postgres 11 or later.

Asides

Your whole setup looks suspiciously messy.

  • mode_pkey is the name of labels PK column. Quite misleading, table seems to have been renamed? Or is something off there?

  • The PK of trajectories:

    "trajectory_pkey" PRIMARY KEY, btree (session_id, "timestamp", lat, lon)
    

    Again, non-default name. A PK spanning 4 columns is suspicious to begin with. And lat and lon are floating point numbers. That's inviting duplicates with rounding errors, leading to all kinds of confusion.

  • Like I commented: If session_id is "initially all zero", the column should probably be NOT NULL DEFAULT 0.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
-1

For trajectories, you want an index on (user_id, timestamp).

That should help your performance.

You also have an issue that there might be multiple matches. If that is the case, your current update has to do much more work than necessary. But the index should be a good first step.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Gordon Linoff, Whatever I have read is, Indexes on tables slows down the DML operations. Isn't it true always? – Ankit Bajpai Jun 11 '20 at 22:57
  • Created an index on `trajectories` table and repeated the `UPDATE` operation, same result. – arilwan Jun 11 '20 at 23:03
  • @AnkitBajpai . . . There is a small overhead in maintaining the index when the data changes. However, your `update` doesn't even affect the columns in the suggested index, so nothing needs to be done. – Gordon Linoff Jun 12 '20 at 00:25
  • @GordonLinoff In PostgreSQL *all* indexes get modified, because a new row version is created. That is, unless you get a HOT update. – Laurenz Albe Jun 12 '20 at 05:53
  • @LaurenzAlbe . . . I thought Postgres checked the version of the row when the index was used. Thank you for the clarification. – Gordon Linoff Jun 12 '20 at 10:38
  • @GordonLinoff Each row version has its own index entry. Different versions of the same row are not connected in any way. – Laurenz Albe Jun 12 '20 at 11:01