1

I have a flat table that I want to organize. The table basically represents a tree structure:

Channel -> (n0) Partners -> (n1) CampaignGroups -> (n2) Campaigns -> ... (ni) Other levels

CREATE TABLE campaign_tree (
    channel_id int,
    channel_name text,
    partner_name text,
    campaign_group_name text,
    campaign_name text,
    ad_name text
);

In order to sanitize the data, make names case-insensitive, and lose redundant IDs, I first find the data that needs to be updated. So I have 2 approaches to this problem:

Approach 1
First get the structure of the tree on the upper levels, then lose the different IDs for the same names:

SELECT
    count(1),
    min(campaign_id) AS new_campaign_id,
    campaign_name,
    channel_name,
    partner_name,
    campaign_group_name
FROM
(SELECT DISTINCT
    campaign_id,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name,
    upper(campaign_name) AS campaign_name
FROM
    campaign_tree
) tmp
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING count(1)>1 --only need to get those that we need to sanitize

This query takes around 350ms to execute. The query plan is as follows:

HashAggregate  (cost=18008.63..18081.98 rows=5868 width=136) (actual time=391.868..404.130 rows=33 loops=1)
  Output: count(1), min(campaign_tree.campaign_id), (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree_campaign_code.partner_name)), (upper(campaign_tree.campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(1) > 1)
  Rows Removed by Filter: 64855
  ->  Unique  (cost=15324.20..16394.93 rows=58680 width=83) (actual time=282.253..338.041 rows=64998 loops=1)
        Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=282.251..305.340 rows=71382 loops=1)
              Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Key: campaign_tree.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Method: external merge  Disk: 6608kB
              ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.015..146.611 rows=71382 loops=1)
                    Output: campaign_tree.campaign_id, upper(campaign_tree.channel_name), upper(campaign_tree.partner_name), upper(campaign_tree.campaign_group_name), upper(campaign_tree.campaign_name)
Planning time: 0.085 ms
Execution time: 407.383 ms

Approach 2
A direct approach: count the distinct ids of items with the same name. Also determine the minimum id of these distinct ids.

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY upper(channel_name), upper(partner_name), upper(campaign_group_name), upper(campaign_name)
HAVING count(distinct campaign_id)>1

Results are the same, just in a different order. Execution time is around 4 seconds each time. Query plan is as follows:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=3723.908..4004.447 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=3718.016..3934.400 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..150.634 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.066 ms
Execution time: 4006.323 ms

Approach 3
After some discussion, I decided to try and change the second approach, and refer to expressions instead of explicitly write them in the GROUP BY clause:

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
   upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY 3, 4, 5, 6
HAVING count(distinct campaign_id)>1

Query Plan:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=1148.957..1316.564 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=1148.849..1240.184 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..148.835 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.067 ms
Execution time: 1318.397 ms

And no, there are no indexes created on this table. I know they will improve things. That's not the point of this question.

The question is: why is there such a big difference in execution time? The query plan doesn't shed any light for me.

Alex
  • 14,338
  • 5
  • 41
  • 59
  • 1
    Please show the output of `explain (analyze, verbose)` not plain analyze –  Oct 20 '16 at 08:37
  • You are grouping by an expression in the second case with the first ("most significant") column being one as well. In the first statement the first "sort" key is an integer - I guess that makes a difference. Might have to do with sorting optimization due to "abbreviated" keys (which might not be possible or as efficient with an expression involving strings - plus on some platforms this has been disabled due to a bug in collation handling) –  Oct 20 '16 at 08:51
  • Wouldn't Postgresql optimize it? Oh wait, it doesn't. I changed the `group by` clause like so: `GROUP BY 3, 4, 5, 6` and it now runs for 1.627 seconds. Still significantly slower than the first approach. Wow, I'm surprised at this. I'll update the question with the relevant info. Thanks for the suggestion. – Alex Oct 20 '16 at 08:59
  • 1
    The aggregation on the (three!) upper()s looks ugly to me. I'd try to first pre-aggregate without the upper(), and sum over its result afterwards. That will probably allow the aggregates to fit into hash tables. – wildplasser Oct 20 '16 at 09:45
  • the reason for upper() is to have a case-insensitive collation. It shouldn't be that costly. – Alex Oct 20 '16 at 11:06
  • Problem is that upper/lower are not sargable, so an index would not be helpful, so sorting will be needed. https://en.wikipedia.org/wiki/Sargable – wildplasser Oct 22 '16 at 11:59
  • in Postgresql indexes can be created on expressions, which are then used when those expressions are used. http://stackoverflow.com/questions/7005302/postgresql-how-to-make-case-insensitive-query check out the correct answer, and the most popular comment to it. – Alex Oct 24 '16 at 15:03

4 Answers4

0

Reading the plans it looks like they diverge when you do the unique vs the group by distinct campaign_id.

This suggests to me that the issue is that group by count(*) > 1 (same as what you are doing) is much cheaper than group by count(distinct campaign_id)

Which makes sense because you have already grouped in the former while in the second you have a secondary calculation you have to make on the grouped set on the second.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182
  • Not really the same. Changing `count(distinct campaign_id)` in both the `group by` clause and in the expression enumerator, to `count(*)` would just count all of the leafs from down this tree, and the result would be much different in this case. And it's weird because in the first approach I am also doing selecting a `distinct` set. – Alex Oct 20 '16 at 08:56
  • 3
    You misunderstood. count(*) and count(1) are the same, performance-wise. Sorry if I was not clear. – Chris Travers Oct 20 '16 at 08:57
  • My point was that having count(distinct ...) looks like it is extremely expensive and prevents certain optimizations. – Chris Travers Oct 20 '16 at 08:58
  • Yes, I misunderstood. I'm just used to writing `count(1)`, that's all. As for distinct - I do need that. And I get that in the first approach, at a very low cost. Not in the second or third (like the second approach, but with optimized grouping, that would seem superfluous, but is definitely not). – Alex Oct 20 '16 at 09:07
  • @ChrisTravers see my answer for some simulated data – pietrop Oct 20 '16 at 23:59
0

Just a thought, but you might instead try:

having max(campaign_id) > min(campaign_id)

It ought to be easier for the execution to track the values of the min and max rather than the distinct number of ids.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • Great idea! but unfortunately it doesn't do anything to reduce the execution time. And just to make the experiment run well, I removed the `count(distinct campaign_id)` entirely from the query. Still the same time to run, which means that there's nothing wrong in `count(distinct campaign_id)` in this situation (as long as there are other aggregate functions on campaign_id anyway) – Alex Oct 20 '16 at 09:41
0

[elaborating on my comment about pre-aggregation] IMHO the cost is in the sorting, which is needed for the aggregation over a set of functions, which cannot be done any other way (anyway not in the current version).

The real solution would of course be to constrain the four domains (maybe even enumerate them, and/or squeeze them into separate tables)

untested, because I don't have the table definitions:


SELECT
    channel_name
    , partner_name
    , campaign_group_name
    , campaign_name
    , min(campaign_id) AS new_campaign_id
    , sum(the_count) AS the_count
FROM (SELECT DISTINCT
        upper(channel_name) AS channel_name
        , upper(partner_name) AS partner_name
        , upper(campaign_group_name) AS campaign_group_name
        , upper(campaign_name) AS campaign_name
        , MIN(campaign_id) AS campaign_id
        , sum(the_count) AS the_count
        FROM (SELECT DISTINCT
            channel_name AS channel_name
            , partner_name AS partner_name
            , campaign_group_name AS campaign_group_name
            , campaign_name AS campaign_name
            , MIN(campaign_id) AS campaign_id
            , COUNT(1) AS the_count
            FROM campaign_tree
            GROUP BY 1,2,3,4
            ) pre
    group BY 1,2,3,4 
    ) agg
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING sum(the_count) > 1 --only need to get those that we need to sanitize
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I had to modify this query. Replace distinct with group by for the innermost 2 queries, and I don't need to trickle up the count, since I needed a way to trim the tree to the top 4 levels (channel, partner, campaign group, campaign), and only THEN to count dupplicates (same campaign name, with case-insensitive match, but different IDs). I ended up with something similar to my first approach, but with 3 levels - aggregation, transformation (upper), and then aggregation again. Even though it was more similar to my first approach, it was as slow as the second approach. – Alex Oct 20 '16 at 11:18
  • Yes, that is probably correct. It *could* have been without errors if you had added the table definition(s) to your question. BTW: why do you have four levels of hierarchy combined in one table? Looks like a lot of redundancy (in the upper levels), which has to be aggregated away afterwards ... – wildplasser Oct 20 '16 at 11:22
  • This query costs less but is slower. Tested with simulated data. – pietrop Oct 20 '16 at 23:29
  • @wildplasser see my answer for some testing data. – pietrop Oct 20 '16 at 23:57
  • I added a table definition. Not that it's any help since the point is in the data. The reason it's flat is because it's imported as CSV, and it needs to be sanitized, and structured afterwards, which this question is really about. – Alex Oct 21 '16 at 08:34
  • @pietrop The table is ok, (thanks!), but the data is not ;-) The four text columns all stem from the same 21-word domain, and there are no upper<-->lowercase duplicates. – wildplasser Oct 22 '16 at 11:52
0

This is not a real answer but simply a help for others who would like to test on some simulated data. I hope it helps on understanding what's going on.

Here is the Python code to create a table named campaign_tree in your database and populate it with n=71382 rows of simulated data (I took this number from the plans):

import random

n = 71382
table_name = "campaign_tree"

set = ["You", "may", "say", "I''m", "a", "dreamer", "But", "I''m", "not", "the", "only", "one", "I", "hope", "someday", "you''ll", "join", "us", "And", "the", "world", "will", "be", "as", "one"]
lset = len(set) - 1

transaction = """
BEGIN;
CREATE TABLE """ + table_name + """
(
    campaign_id integer,
    campaign_name text,
    channel_name text,
    partner_name text,
    campaign_group_name text
);

INSERT INTO """ + table_name + """  (campaign_id, campaign_name, channel_name, partner_name, campaign_group_name)
VALUES """

values = []

i = 1
while i <= n:
    values = values + ["(" + \
                            `i` + ", '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "')"]
    i = i + 1

transaction = transaction + ",\n".join(values) + "; COMMIT;"

foutput = open("test.sql", "w")
foutput.write(transaction)
foutput.close()

Save it as test.py and then execute python test.py. It will generate a file named test.sql. Finally, execute psql -f test.sql and you are done. Happy testing :)

pietrop
  • 1,071
  • 2
  • 10
  • 27