I have a flat table that I want to organize. The table basically represents a tree structure:
Channel -> (n0) Partners -> (n1) CampaignGroups -> (n2) Campaigns -> ... (ni) Other levels
CREATE TABLE campaign_tree (
channel_id int,
channel_name text,
partner_name text,
campaign_group_name text,
campaign_name text,
ad_name text
);
In order to sanitize the data, make names case-insensitive, and lose redundant IDs, I first find the data that needs to be updated. So I have 2 approaches to this problem:
Approach 1
First get the structure of the tree on the upper levels, then lose the different IDs for the same names:
SELECT
count(1),
min(campaign_id) AS new_campaign_id,
campaign_name,
channel_name,
partner_name,
campaign_group_name
FROM
(SELECT DISTINCT
campaign_id,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name,
upper(campaign_name) AS campaign_name
FROM
campaign_tree
) tmp
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING count(1)>1 --only need to get those that we need to sanitize
This query takes around 350ms to execute. The query plan is as follows:
HashAggregate (cost=18008.63..18081.98 rows=5868 width=136) (actual time=391.868..404.130 rows=33 loops=1)
Output: count(1), min(campaign_tree.campaign_id), (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree_campaign_code.partner_name)), (upper(campaign_tree.campaign_group_name))
Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Filter: (count(1) > 1)
Rows Removed by Filter: 64855
-> Unique (cost=15324.20..16394.93 rows=58680 width=83) (actual time=282.253..338.041 rows=64998 loops=1)
Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=282.251..305.340 rows=71382 loops=1)
Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Key: campaign_tree.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Method: external merge Disk: 6608kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.015..146.611 rows=71382 loops=1)
Output: campaign_tree.campaign_id, upper(campaign_tree.channel_name), upper(campaign_tree.partner_name), upper(campaign_tree.campaign_group_name), upper(campaign_tree.campaign_name)
Planning time: 0.085 ms
Execution time: 407.383 ms
Approach 2
A direct approach: count the distinct ids of items with the same name. Also determine the minimum id of these distinct ids.
SELECT
count(distinct campaign_id) AS cnt,
min(campaign_id) AS new_campaign_id,
upper(campaign_name) AS campaign_name,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY upper(channel_name), upper(partner_name), upper(campaign_group_name), upper(campaign_name)
HAVING count(distinct campaign_id)>1
Results are the same, just in a different order. Execution time is around 4 seconds each time. Query plan is as follows:
GroupAggregate (cost=15324.20..17912.57 rows=51588 width=83) (actual time=3723.908..4004.447 rows=33 loops=1)
Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
Rows Removed by Filter: 64855
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=3718.016..3934.400 rows=71382 loops=1)
Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
Sort Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Method: external merge Disk: 6880kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..150.634 rows=71382 loops=1)
Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.066 ms
Execution time: 4006.323 ms
Approach 3
After some discussion, I decided to try and change the second approach, and refer to expressions instead of explicitly write them in the GROUP BY
clause:
SELECT
count(distinct campaign_id) AS cnt,
min(campaign_id) AS new_campaign_id,
upper(campaign_name) AS campaign_name,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY 3, 4, 5, 6
HAVING count(distinct campaign_id)>1
Query Plan:
GroupAggregate (cost=15324.20..17912.57 rows=51588 width=83) (actual time=1148.957..1316.564 rows=33 loops=1)
Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
Group Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
Rows Removed by Filter: 64855
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=1148.849..1240.184 rows=71382 loops=1)
Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
Sort Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
Sort Method: external merge Disk: 6880kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..148.835 rows=71382 loops=1)
Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.067 ms
Execution time: 1318.397 ms
And no, there are no indexes created on this table. I know they will improve things. That's not the point of this question.
The question is: why is there such a big difference in execution time? The query plan doesn't shed any light for me.