10

I have the following query:

SELECT DISTINCT 
    e.id, 
    folder, 
    subject, 
    in_reply_to, 
    message_id, 
    "references", 
    e.updated_at,
    (
        select count(*)  
        from emails  
        where 
        (
            select "references"[1] 
            from emails 
            where message_id = e.message_id
        ) = ANY ("references") 
        or message_id = 
        (
            select "references"[1] 
            from emails 
            where message_id = e.message_id
        )
    )
FROM "emails" e
INNER JOIN "email_participants" 
    ON ("email_participants"."email_id" = e."id") 
WHERE (("user_id" = 220) 
AND ("folder" = 'INBOX')) 
ORDER BY e."updated_at" DESC 
LIMIT 10 OFFSET 0;

Here is the explain analyze output of the above query.

The query peformed fine until I added the count subquery below:

(
    select count(*)  
    from emails  
    where 
    (
        select "references"[1] 
        from emails 
        where message_id = e.message_id
    ) = ANY ("references") 
    or message_id = 
    (
        select "references"[1] 
        from emails 
        where message_id = e.message_id
    )
)

In fact I have tried simpler subqueries and it seems to be the aggregate function itself that is taking the time.

Is then an alternative way that I could append the count subquery onto each result? Should I update the results after the initial query has run for example?

Here is a pastebin that will create the table and also run the badly performing query at the end to display what the output should be.

dagda1
  • 26,856
  • 59
  • 237
  • 450
  • Your biggest problem is that as an array it's effectively a comma-separated column, and I doubt that the db is able to build an index to support this type of query. Anything it tried to build would be on the order of, wait for it, a regular cross-reference table. Your best option may be to extract this column to another table, potentially with a view to derive the current table if necessary. I also find the use of a specific item index slightly worrying - are you explicitly maintaining that element somehow? – Clockwork-Muse May 21 '14 at 13:46
  • I think you are spot on with regards to extracting this into a table. What do you mean by item index? – dagda1 May 21 '14 at 13:56
  • I agree with @Clockwork-Muse that the best design would be to move these values from an array to another table, with a row for each "slot" in the array (i.e. each array index/element). From the postgresql documentation on arrays (http://www.postgresql.org/docs/9.1/static/arrays.html ): Tip: Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements. – BateTech May 21 '14 at 21:23
  • That being said, if you do not have the option to migrate these values to another table, postgresql 9.2.1+ does support indexing arrays, so that could help some. Checkout http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns which the first answer shows how to index the entire array and the 2nd answer gives an example of indexing a single element in an array. – BateTech May 21 '14 at 21:25
  • "Specific item index" refers to behavior like `references[1] = `. The order of items in a set is meta information, and usually shouldn't be relied upon. If you need to keep a specific reference, you may want a specific column. – Clockwork-Muse May 21 '14 at 22:06
  • Indexing first element of references will not help here, as the planner shall iterate on all emails rows that satisfy the condition of the main query anyway. To save a full scan here, OP would need an index on user_id or folder or updated_at depending on the data. What OP really needs is to avoid the inner full scan on emails with an index on the whole references array. – Paul Guyot May 21 '14 at 22:39

4 Answers4

5

Expanding on Paul Guyot's answer you could move the subquery into a derived table, which should perform faster because it fetches the message counts in one scan (plus a join) as opposed to 1 scan per row.

SELECT DISTINCT 
    e.id, 
    e.folder, 
    e.subject, 
    in_reply_to, 
    e.message_id, 
    e."references", 
    e.updated_at,
    t1.message_count
FROM "emails" e
INNER JOIN "email_participants" 
    ON ("email_participants"."email_id" = e."id") 
INNER JOIN (
    SELECT COUNT(e2.id) message_count, e.message_id
    FROM emails e
    LEFT JOIN emails e2 ON (ARRAY[e."references"[1]] <@ e2."references"
    OR e2.message_id = e."references"[1])
    GROUP BY e.message_id
) t1 ON t1.message_id = e.message_id
WHERE (("user_id" = 220) 
AND ("folder" = 'INBOX')) 
ORDER BY e."updated_at" DESC 
LIMIT 10 OFFSET 0;

Fiddle using pastebin data - http://www.sqlfiddle.com/#!15/c6298/7

Below are the query plans postgres produces for getting count in a correlated subquery vs getting count by joining a derived table. I used one of my own tables but I think the results should be similar.

Correlated Subquery

"Limit  (cost=0.00..1123641.81 rows=1000 width=8) (actual time=11.237..5395.237 rows=1000 loops=1)"
"  ->  Seq Scan on visit v  (cost=0.00..44996236.24 rows=40045 width=8) (actual time=11.236..5395.014 rows=1000 loops=1)"
"        SubPlan 1"
"          ->  Aggregate  (cost=1123.61..1123.62 rows=1 width=0) (actual time=5.393..5.393 rows=1 loops=1000)"
"                ->  Seq Scan on visit v2  (cost=0.00..1073.56 rows=20018 width=0) (actual time=0.002..4.280 rows=21393 loops=1000)"
"                      Filter: (company_id = v.company_id)"
"                      Rows Removed by Filter: 18653"
"Total runtime: 5395.369 ms"

Joining a Derived Table

"Limit  (cost=1173.74..1211.81 rows=1000 width=12) (actual time=21.819..22.629 rows=1000 loops=1)"
"  ->  Hash Join  (cost=1173.74..2697.72 rows=40036 width=12) (actual time=21.817..22.465 rows=1000 loops=1)"
"        Hash Cond: (v.company_id = visit.company_id)"
"        ->  Seq Scan on visit v  (cost=0.00..973.45 rows=40045 width=8) (actual time=0.010..0.198 rows=1000 loops=1)"
"        ->  Hash  (cost=1173.71..1173.71 rows=2 width=12) (actual time=21.787..21.787 rows=2 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 1kB"
"              ->  HashAggregate  (cost=1173.67..1173.69 rows=2 width=4) (actual time=21.783..21.784 rows=3 loops=1)"
"                    ->  Seq Scan on visit  (cost=0.00..973.45 rows=40045 width=4) (actual time=0.003..6.695 rows=40046 loops=1)"
"Total runtime: 22.806 ms"
FuzzyTree
  • 32,014
  • 3
  • 54
  • 85
  • There are a couple of typos in the query, but since OP didn't provide the actual schema it is difficult to come up with a full query. If I got it right in spite of these errors, this derived table solution seems to be much slower (4 times with 1000 dummy rows) as it requires *two* scans of emails table (the index saves only one). – Paul Guyot May 27 '14 at 14:17
  • @PaulGuyot in my experience derived tables are usually faster than selecting a subquery for the same reason mentioned in Neto's answer i.e. 1 search per row of the result set vs getting the counts in one go and doing a hash join – FuzzyTree May 27 '14 at 15:54
3

From what I understand of the semantics of your query, you can simplify:

select count(*)  
from emails  
where 
(
    select "references"[1] 
    from emails 
    where message_id = e.message_id
) = ANY ("references") 
or message_id = 
(
    select "references"[1] 
    from emails 
    where message_id = e.message_id
)

to:

select count(*)  
from emails  
where 
e."references"[1] = ANY ("references") OR message_id = e."references"[1]

Indeed, message_id is not necessarily unique, but if, for a given value of message_id, you do have distinct rows, your query will fail.

This simplification does not, however, change the cost of the query significantly. Indeed, the issue here is that you need two full scans of table emails to perform the query (as well as an index scan on emails_message_id_index). You could save one full scan by using an index on the references array.

You would create such an index with:

CREATE INDEX emails_references_index ON emails USING GIN ("references");

The index alone does help the initial query significantly: provided that there are up-to-date statistics, as with a sufficiently large number of rows, PostgreSQL will perform an index scan. Yet, you should alter the subquery as follows, to help the planner perform a bitmap index scan on this array index:

select count(*)  
from emails
where 
ARRAY[e."references"[1]] <@ "references"
OR message_id = e."references"[1]

The final query would read:

SELECT DISTINCT 
    e.id, 
    folder, 
    subject, 
    in_reply_to, 
    message_id, 
    "references", 
    e.updated_at,
    (
        select count(*)  
        from emails
        where 
        ARRAY[e."references"[1]] <@ "references"
        OR message_id = e."references"[1]
    )
FROM "emails" e
INNER JOIN "email_participants" 
    ON ("email_participants"."email_id" = e."id") 
WHERE (("user_id" = 220) 
AND ("folder" = 'INBOX')) 
ORDER BY e."updated_at" DESC 
LIMIT 10 OFFSET 0;

To illustrate the expected gains, some tests were conducted in a dummy environment:

  • with 10,000 rows in table emails (and corresponding rows in table email_participants), initial query runs in 787ms, with the index scan this drops to 399ms and the proposed query runs in 12ms;
  • with 100,000 rows initial query runs in 9,200ms, with the index scan this drops to 4,251ms and the proposed query runs in 637ms.
Paul Guyot
  • 6,257
  • 1
  • 20
  • 31
2

It is not easy to get this right without test data

select
    e.id,
    folder,
    subject,
    in_reply_to,
    message_id,
    "references",
    e.updated_at,
    sum(the_count) as the_count
from
    (
        select *, (
                "references"[1] = any ("references")
                or
                message_id = "references"[1]
            )::integer as the_count
        from emails
    ) e
    inner join
    email_participants on email_participants.email_id = e.id
where
    user_id = 220
    and
    folder = 'INBOX'
group by 1, 2, 3, 4, 5, 6, 7
order by e.updated_at desc
limit 10 offset 0;

The reason your query is slow is that you do a table or an index search for each row of your result set. That is called a correlated subquery.

The group by 1, 2,... is just a short hand for the column names in the select list.

The cast from boolean to integer yields 1 or 0.

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • That is amazing. I wish I knew why this is so much faster. Could you please explain. Also could you please explain the grouping group 1, 2, 3, 4, 5, 6, 7. That is not something I have seen before. – dagda1 May 09 '14 at 13:46
  • Actually, it is only ever pulling back a maximum count of 1 which is not accurate. In my query, I am checking the subquery message_id from the outer query's message_id which is probably what is slowing things up. – dagda1 May 09 '14 at 14:10
  • @dagda1 Ok I have a couple more ideas but without data I'm in the dark. As an example, in your subquery as `message_id` is not a primary key (or is it?) it will throw the error `more than one row returned by a subquery used as an expression`. So show your table's schema, some data and desired output. – Clodoaldo Neto May 09 '14 at 14:49
  • I have updated the question and here is a pastebin which will create the table and runs my bad query at the end to show what the output should be http://pastebin.com/nzwCGxD2 – dagda1 May 09 '14 at 15:30
0

I used your query in the pastebin as a starting point. This differs from the one posted here in that it doesn't join the email_participants table.

I believe it can be as simple as this (or did I miss something?):

SELECT e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at, COUNT(e1.message_id)
FROM emails e
LEFT OUTER JOIN emails e1
ON e1.message_id = e.message_id
AND (e1.references[1] = ANY (e.references) OR e1.references[1] = e.message_id)
GROUP BY e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at;
wvdz
  • 16,251
  • 4
  • 53
  • 90
  • This unlikely to perform as well as the OP would like - among other things, that `GROUP BY` is too wide to be part of a likely index. Assuming that the db isn't able to index the array column effectively, the query will likely still require full table scans for the join. You did move it out of a likely RBAR situation, which may be some help. – Clockwork-Muse May 21 '14 at 21:27