Speeding up an expensive join operation - Huge table with a small one

Question

I'm using Postgres to store a large number of transactions and trying to keep the read times for a specific Select statement in tens of milliseconds.

Schema of TableA (> 100mm rows): (userID int, itemID int). Indexed by userID

Schema of TableB (1mm rows): (categoryID int, itemID int). Indexed by categoryID. Number of categories = 500 and each itemID only belongs to one category.

The query I want to optimize for which currently takes me ~100 ms to execute is:

select * from TableA 
where userID = x and itemID in 
(select itemID from TableB
where categoryID = y)

A simple way to solve this would be to create a denormalised table with userID, itemID and categoryID as columns and index on (userID, categoryID). However, the categoryID -> itemID mapping can change so I wanted to avoid doing a full scan of the table and update the rows each time this happens.

Are there any other techniques/indexing method to speed up this JOIN operation? Any alternative ways to arrange the data would also be appreciated. Thanks!

Edit: Adding a sample query plan.

[('  ->  Hash Semi Join  (cost=159.50..382.67 rows=164 width=50)'),
 ('        Hash Cond: (tableA.itemId = tableB.itemId)'),
 ('        ->  Index Scan using userId on tableA  (cost=0.57..208.31 rows=5185 width=50)'),
 ('              Index Cond: (userId = 4000)'),
 ('        ->  Hash  (cost=117.05..117.05 rows=3350 width=4)'),
 ('              Buckets: 4096  Batches: 1  Memory Usage: 161kB',),
 ('              ->  Index Scan using categoryId on tableB (cost=0.42..117.05 rows=3350 width=4)'),
 ('                    Index Cond: (categoryId = 1002)',), ('Planning time: 0.149 ms',)]

Do you have an explain plan for that query? Have you tired EXISTS instead of IN ? — Paul Maxwell, Jan 04 '19 at 08:47
ty for noting. I added a sample query plan. But not sure if Exists is the right method for the purpose when I want to pull in rows? I'll give it a try! — ananis, Jan 04 '19 at 08:58
Thanks for the explain plan. Index scans seem reasonable here and I see that exists is same/similar. Nothing much I can suggest. — Paul Maxwell, Jan 04 '19 at 09:43
right, looks difficult to improve on the query. Do you see an alternative way to arrange the data - mix/match columns? — ananis, Jan 04 '19 at 10:29
It is a simple "lookup" arrangement, I don't see anything obvious, sorry. — Paul Maxwell, Jan 04 '19 at 10:39

score 1 · Answer 1 · answered Jan 04 '19 at 08:49

1

Maybe Exists will help here: Difference between EXISTS and IN

For your query:

Select * from TableA a
Where userID = x
and exists (Select itemId from TableB b where categoryID = y  and a.itemId = b.itemId)

answered Jan 04 '19 at 08:49

M. Schreiber

96
5

thanks for your suggestion. Postgres yields an identical query plan and similar execution times for this as the _in_ equivalent I was using earlier. – ananis Jan 04 '19 at 09:29
If this query is in frequent use then you should apply index on both tables on UserID, ItemID and CategoryID, ItemID respectively as there's index scan in your query plan. – Mr. K Jan 04 '19 at 11:44

score 1 · Accepted Answer · answered Jan 05 '19 at 06:22

I found a neat way to solve this by making tableA denormalized and using Postgres foreign keys.

Schema of TableA (> 100mm rows): (userID int, itemID int, categoryID int)
Index - (userID, categoryID)
FK - (itemID, categoryID) references tableB (itemID, categoryID)
update cascade
delete cascade

Schema of TableB (1mm rows): (categoryID int, itemID int)
PK - (itemID, categoryID)

All user-item pairs for a category can now be fetched by doing a select on tableA. The foreign key constraint makes sure that rows in tableA get updated if the categoryID for any item changes in tableB.

select userid, itemid from tableA where userid = x and categoryid = y

Thanks for your suggestions!

score 0 · Answer 3 · answered Jan 04 '19 at 09:18

0

Another approach would be to create array of valid itemID and filter by it. Then you will avoid JOIN operation. It can however be slower, depending on your data.

select * from TableA 
where userID = x
  and itemID = any((select array_agg(/*DISTINCT */itemID)
                      from TableB
                     where categoryID = y)::int4[])

answered Jan 04 '19 at 09:18

Łukasz Kamiński

5,630
1
19
32

This feels like the same as the original _in_ query? The hash join doesn't seem to be the expensive part in the query plan. – ananis Jan 04 '19 at 12:58
@ananis I can't really tell how it feels to you. I imagine you can just run it and see for yourself... – Łukasz Kamiński Jan 04 '19 at 17:50
Thx for your suggestion Lukasz, indeed wrong choice of word by me :). I meant I could't see why would this be any faster than the original query. I tried and it comes out to be slower. – ananis Jan 05 '19 at 06:14
@ananis Yeah, it depends on how many items will be inside array, if JOIN requires sorting, if you have it indexed, if you use partitions. I would still keep it in mind when you get stuck in future, since I have plenty of cases when this is faster. – Łukasz Kamiński Jan 07 '19 at 08:31

Speeding up an expensive join operation - Huge table with a small one

3 Answers3