0

I need to make a query where it returns all requests that are not finished or canceled from the beginning of recordings to a specific date. The way I'm doing right now, take too much time and returns an error: 'User query might have needed to see row versions that must be removed'(my guess it's due of lack of RAM).

Below is the query I'm using, and here are some information:

  • T1 where each new entry is saved, with an ID, creation date, status(open,closed) and other keys for several tables.

  • T2 where each change made in each request is saved(in progress, waiting, rejected and closed), date of change and other keys for other tables.

    SELECT T1.id_request,
           T1.dt_created,
           T1.status
    FROM T1
    LEFT JOIN T2
        ON T1.id_request = T2.id_request
    WHERE (T1.dt_created >= '2012-01-01 00:00:00' AND T1.dt_created <= '2020-05-31 23:59:59')
        AND T1.id_request NOT IN (SELECT T2.di_request
                                  FROM T2
                                  WHERE ((T2.dt_change >= '2012-01-01 00:00:00' 
                                         AND T2.dt_change <= '2020-05-31 23:59:59')
                                         OR T2.dt_change IS NULL)
                                         AND T2.status IN ('Closed','Canceled','rejected'))
    

My thoughts were to get all that is received - T1(I can't just retrieve what is open, it will only work for today, not to a specific past date - what I want) between the beginning of the records and lets say end of May. Then use WHERE T1.ID NOT IN (T2.ID with STATUS 'closed', in the same period). But as I've said it takes forever and returns an error.

I use this same code to get what was open for a specific month(1st to 30rd) and works perfectly fine.

Maybe this approach is not the best approach, but I couldn't think of any other way(I'm not an expert with SQL). If there's not enough information to provide an answer fell free to ask.

As per request from @MikeOrganek here is the analyzer:

  Nested Loop Left Join  (cost=27985.55..949402.48 rows=227455 width=20) (actual time=2486.433..54832.280 rows=47726 loops=1)
   Buffers: shared hit=293242 read=260670
    Seq Scan on T1 (cost=27984.99..324236.82 rows=73753 width=20) (actual time=2467.499..6202.970 rows=16992 loops=1)
  Filter: ((dt_created >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_created <= '2020-05-31 23:59:59-03'::timestamp with time zone) AND (NOT (hashed SubPlan 1)))
   Rows Removed by Filter: 6085779
    Buffers: shared hit=188489 read=250098
  SubPlan 1
   Nested Loop  (cost=7845.36..27983.13 rows=745 width=4) (actual time=129.379..1856.518 rows=168690 loops=1)
    Buffers: shared hit=60760
     Seq Scan on T3(cost=0.00..5.21 rows=3 width=8) (actual time=0.057..0.104 rows=3 loops=1)
     Filter: ((status_request)::text = ANY ('{Closed,Canceled,rejected}'::text[]))
     Rows Removed by Filter: 125
     Buffers: shared hit=7
     Bitmap Heap Scan on T2(cost=7845.36..9321.70 rows=427 width=8) (actual time=477.324..607.171 rows=56230 loops=3)
     Recheck Cond: ((dt_change >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_change <= '2020-05-31 23:59:59-03'::timestamp with time zone) AND (T2.ID_status= T3.ID_status))
     Rows Removed by Index Recheck: 87203
     Heap Blocks: exact=36359
     Buffers: shared hit=60753
      BitmapAnd  (cost=7845.36..7845.36 rows=427 width=0) (actual time=473.864..473.864 rows=0 loops=3)
      Buffers: shared hit=24394
      Bitmap Index Scan on idx_ix_T2_dt_change (cost=0.00..941.81 rows=30775 width=0) (actual time=47.380..47.380 rows=306903 loops=3)
      Index Cond: ((dt_change >= '2020-05-01 00:00:00-03'::timestamp with time zone) AND (dt_change<= '2020-05-31 23:59:59-03'::timestamp with time zone))
      Buffers: shared hit=2523
      Bitmap Index Scan on idx_T2_ID_status  (cost=0.00..6895.49 rows=262724 width=0) (actual time=418.942..418.942 rows=2105165 loops=3)
      Index Cond: (ID_status = T3.ID_status )
      Buffers: shared hit=21871
    Index Only Scan using idx_ix_T2_id_request  on T2  (cost=0.56..8.30 rows=18 width=4) (actual time=0.369..2.859 rows=3 loops=16992)
    Index Cond: (id_request = t17.id_request )
    Heap Fetches: 44807
    Buffers: shared hit=104753 read=10572
    Planning time: 23.424 ms
    Execution time: 54841.261 ms

And here is the main difference with dt_change IS NULL:

  Planning time: 34.320 ms
  Execution time: 230683.865 ms

Thanks

Gustavo Rottgering
  • 511
  • 1
  • 4
  • 11
  • Please see this for an explanation of the error you are getting: https://stackoverflow.com/a/39450919/13808319 As for your questions, do your `T2` history records ever get updated (other than adding new change records)? That is, is the history fixed? – Mike Organek Aug 18 '20 at 14:43
  • @MikeOrganek once the record has a changem that row is fixed. Each new change is a new row in T2 with a new date. That's my main problem. To find only what still open in that specific date, I need to get the last change of the request before that date that is not closed yet. – Gustavo Rottgering Aug 18 '20 at 15:31
  • If so, then the error likely is due to changes happening on `T1` while your query is running. Seeing `OR T2.dt_change is NULL` in your subquery is troublesome from a performance standpoint. Is this strictly necessary? Can you please post an `explain (analyze, buffers) ` for your query run over a one-month interval? – Mike Organek Aug 18 '20 at 15:39
  • @MikeOrganek About the `OR IS NULL` I used for another problem and jsut let it there. It can be removed. `explain (analyze, buffers)` extraction will be posted as an answer. An extra question: can this analyzer be used with psycopg2? Usually I use it to extract information and the process with pandas. – Gustavo Rottgering Aug 18 '20 at 17:14
  • Seq Scan on T1 (cost=27984.99..324236.82 rows=73753 width=20) (actual time=2467.499..6202.970 rows=16992 loops=1) Buffers: shared hit=104753 read=10572 Planning time: 23.424 ms Execution time: 54841.261 ms @MikeOrganek does it enough information on the analyzer? – Gustavo Rottgering Aug 18 '20 at 17:22
  • Can you please edit your question and paste in the execution plan? Also, can you please run the `explain` once with the `OR T2.dt_change is NULL` and once without so you can see if there is any impact from including that? – Mike Organek Aug 18 '20 at 17:26
  • @MikeOrganek question edited with info – Gustavo Rottgering Aug 18 '20 at 17:53

1 Answers1

0

It looks like the OR T2.dt_change is NULL is very costly in that it increased overall execution time by a factor of five.

The only option I can see is changing the not in to a not exists, as below.

SELECT T1.id_request,
       T1.dt_created,
       T1.status
  FROM T1
       LEFT JOIN T2
              ON T1.id_request = T2.id_request
 WHERE T1.dt_created >= '2012-01-01 00:00:00' 
   AND T1.dt_created <= '2020-05-31 23:59:59'
   AND NOT EXISTS (SELECT 1
                     FROM T2
                    WHERE id_request = T1.id_request
                      AND (   (    dt_change >= '2012-01-01 00:00:00' 
                               AND dt_change <= '2020-05-31 23:59:59')
                           OR dt_change IS NULL)
                      AND status IN ('Closed','Canceled','rejected'))

But I expect that to give you only a marginal improvement. Can you please see how much this change helps?

Mike Organek
  • 11,647
  • 3
  • 11
  • 26
  • This returned nothing. Why use SELECT 1(I tried with both it and the id_request)? Even for the monthly query. @MikeOrganek – Gustavo Rottgering Aug 18 '20 at 19:19
  • @GustavoRottgering `select 1` is just a convention I picked up a long time ago. That is very strange that it returned no rows. Are you sure that the subquery includes `id_request = T1.id_request`? What comes back if you run only the subquery? – Mike Organek Aug 18 '20 at 19:32
  • sorry for takign so long(VPN at work was really slow). Running only subquery it returns a column. Couldn't exactly check how many though. Any ideias why it's not returning anything on the entire query? – Gustavo Rottgering Aug 18 '20 at 21:43
  • @GustavoRottgering It will not return anything if the subquery returns a row for every `T1.id_request`. That's why I asked about whether you had the correlation in the subquery `where` clause. How long did running just the subquery take? – Mike Organek Aug 18 '20 at 22:02
  • It took around 8m for monthly and entire period request. – Gustavo Rottgering Aug 19 '20 at 13:46