0

I need to count the total no of rows in a table with a where clause. My application can tolerate some level of inaccuracy.

SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" = 'paid';

But clearly, this is a very slow query. I came across this answer but that returns the count of all rows in the table.

What's a faster method of counting when I have a where clause? I'm using sequelize's ORM, so any relevant method in sequelize would also help.

So, doing EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) AS "count" FROM "Orders" AS "Order" WHERE "Order"."orderType" = 'online' AND "Order"."status" != 'paid'; returns me the following:

Aggregate  (cost=47268.10..47268.11 rows=1 width=8) (actual time=719.722..719.723 rows=1 loops=1)
 Buffers: shared hit=32043
  ->  Seq Scan on ""Orders"" ""Order""  (cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)
       Filter: (((status)::text <> 'paid'::text) AND ((""orderType"")::text = 'online'::text))
        Rows Removed by Filter: 830133
        Buffers: shared hit=32043
Planning time: 0.069 ms
Execution time: 719.755 ms

farhan
  • 59
  • 1
  • 1
  • 6
  • @a_horse_with_no_name I have edited the question with explain(analyze, buffers). Can u tell me how can i get the create index statements, so I can add those too? – farhan Apr 29 '20 at 12:48
  • Your execution plan doesn't match your query (the plan shows `<> 'paid'` but your query uses `= 'paid'` –  Apr 29 '20 at 13:09
  • I would try an index on `order (ordertype, status)` at least for the initial query `<>` conditions can't really be indexed. –  Apr 29 '20 at 13:10
  • Are both conditions selective, i.e., reduce the result count significantly? – Laurenz Albe Apr 29 '20 at 13:26

1 Answers1

0

My application can tolerate some level of inaccuracy.

This is pretty hard to capitalize on in PostgreSQL. It is fairly easy to get an approximate answer, but hard to put a limit on how approximate that answer is. There will always be cases where the estimate can be very wrong, and it is hard to know when that is occurring without doing the work needed to get an exact answer.

In your query plan, it is off by a factor of 2.17. Is that good enough?

(cost=0.00..47044.35 rows=89501 width=0) (actual time=0.011..674.316 rows=194239 loops=1)

Or, can you put bounds on tolerable inaccuracy in some other dimension? Like "accurate as of some point in the last hour"? With that kind of tolerance, you could make a materialized view to partially summarize the data, like:

create materialized view order_counts as 
    SELECT "orderType", "status", count(*) AS "count" FROM "Orders"
    group by 1,2;

and then pull the counts out of that with your WHERE clause (and possibly resummarize them). The effectiveness of this depends on the number of combinations of "orderType" and "status" being much less than the total number of rows in the main table. You would have to set up a scheduled job to refresh the matview periodically. It is not implemented to have PostgreSQL rewrite your original query to use the matview, you have to rewrite it yourself.

You have shown us two different queries, status = 'paid' and status != 'paid'. Is one of those a mistake, or do they reflect variation in the queries you actually want to run? What other things might vary in this pool of similar queries? You should be able to get some speed up using indexes, but which index in particular will depend on your queries. For the equality query, you can include "status" in the index. For inequality query, that wouldn't do much good, so instead you could use a partial index WHERE status<>'paid'. (But then that index wouldn't be useful if 'paid' was changed to 'delinquent', for example.)

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • Thanks for your answer! 1. It can be accurate as of last hour or even 2 hours. 2. I need to do this for both ```status = paid and status != paid``` 3. And, off by a factor of 2.17 is not really acceptable. Errors in the range of 10-15 % are stil acceptable. – farhan Apr 30 '20 at 08:46