1

I am running the below function. Testing it with a much smaller table works as expected (18 rows - ~400ms). However, when pointed at my real data (315000 rows), it is running 48hrs and still going. This is much longer than I expected under a linear extrapolation.

  1. Is there a better way to do this?
  2. Is there a way to test if it is doing what it should be doing while still running?

Is there any way to optimise the below function?

DO 
$do$ 
DECLARE r public.tablex%rowtype; 
BEGIN 
FOR r IN SELECT id FROM public.tablex 
LOOP 
IF (select cast((select trunc(random() * 6 + 1)) as integer) = 5) THEN 
UPDATE public.tablex SET test='variable1' WHERE id = r.id; 
ELSIF (select cast((select trunc(random() * 6 + 1)) as integer) = 6) THEN 
UPDATE public.tablex SET test='variable2' WHERE id = r.id; 
END IF; 
END LOOP; 
RETURN; 
END 
$do$;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228

2 Answers2

4

@kordirko made clear that your DO statement is needlessly expensive. But there's more.

Probability

The probability in your update is not evenly distributed:

  • 1/6 or all rows are updated to 'variable1'.
  • But only 5/36 ((1/6) * (5/6)) are updated to 'variable2'.

Judging from the rest of your code, I am going to assume this is an unintended error and you want an equal 1/6 share for each.

Sanitize

You can simplify to:

UPDATE tablex
SET    test = CASE trunc(random() * 6)
                 WHEN  float '4' THEN 'variable1'
                 WHEN  float '5' THEN 'variable2'
                 ELSE  test
              END;
  • It's pointless to add 1 to the result. Rather compare to 4 and 5 instead of 5 and 6 (or 3 and 1 - no difference here).

  • It's a bit cheaper to compare to double precision (= float) constants instead of casting the double precision result of the expression trunc(random() * 6) to integer for every row, like it would happen in your original code.

Better, but still very inefficient.

Superior

UPDATE tablex
SET    test = CASE WHEN random() >= float '0.5' THEN 'variable1'
                                                ELSE 'variable2' END
WHERE  random() >= float '0.6666667';

100 % equivalent (unless you have triggers ON UPDATE or some exotic setup) but much faster yet, because only rows that actually receive an update are touched. Two thirds of the rows are not touched at all.

  • Note that the two random() calls are completely independent.

  • Use 2 / float '3.0' instead of float '0.6666667' if you want to be as precise as possible about the probability of two thirds. But that's really an academic difference.

  • You might want to LOCK tablex IN ROW EXCLUSIVE MODE before you run this (in the same transaction) to rule out race conditions with concurrent writes. Details in the manual.

Alternative

If the column test can already hold one of the target values 'variable1' or 'variable2' (in more than a few cases) this is cheaper, yet:

UPDATE tablex t
SET    test = upd.val
FROM  (
   SELECT id, CASE WHEN random() >= float '0.5' THEN 'variable1'
                                                ELSE 'variable2' END AS val
   FROM   tablex
   WHERE  random() >= float '0.6666667'
   -- ORDER  BY id   -- the last two lines only to defend against ...
   -- FOR    UPDATE  -- ... concurrent writes and possible deadlocks
   ) upd
WHERE  t.id = upd.id
AND    t.test IS DISTINCT FROM upd.val;

Avoiding more empty updates.
If test is defined NOT NULL you can simplify to:

 AND    t.test <> upd.val;

Compare the last chapter of this related answer:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you. Very useful.It was indeed an unintended error – user2263513 Mar 29 '15 at 05:49
  • I Just wanted to successfully run the query - in the end it took me ~86000ms. – user2263513 Mar 29 '15 at 18:40
  • @user2263513: 86 seconds is a remarkable improvement over 48+ hours. But typically I would expect this to be faster on a halfway decent server, a few seconds at most for 300k rows - unless other things slow you down, like very wide rows or expensive indexes or triggers or concurrent load ... Which variant did you apply? And did you lock the table? – Erwin Brandstetter Mar 29 '15 at 21:46
  • UPDATE tablex SET test = CASE WHEN random() >= float '0.5' THEN 'variable1' ELSE 'variable2' END WHERE random() >= float '0.6666667'; You picked it in one, I also had some active triggers slowing things down – user2263513 Mar 30 '15 at 06:58
1

Try a simple update:

UPDATE tablex
SET  test =  CASE trunc(random() * 6) + 1
                 WHEN  5 THEN 'variable1'
                 WHEN  6 THEN 'variable2'
                 ELSE test
       END
;

I guess it will be at least 50~200 times faster.

krokodilko
  • 35,300
  • 7
  • 55
  • 79