I'm using Postgres to store a large number of transactions and trying to keep the read times for a specific Select statement in tens of milliseconds.
Schema of TableA (> 100mm rows): (userID int, itemID int). Indexed by userID
Schema of TableB (1mm rows): (categoryID int, itemID int). Indexed by categoryID. Number of categories = 500 and each itemID only belongs to one category.
The query I want to optimize for which currently takes me ~100 ms to execute is:
select * from TableA
where userID = x and itemID in
(select itemID from TableB
where categoryID = y)
A simple way to solve this would be to create a denormalised table with userID, itemID and categoryID as columns and index on (userID, categoryID). However, the categoryID -> itemID mapping can change so I wanted to avoid doing a full scan of the table and update the rows each time this happens.
Are there any other techniques/indexing method to speed up this JOIN operation? Any alternative ways to arrange the data would also be appreciated. Thanks!
Edit: Adding a sample query plan.
[(' -> Hash Semi Join (cost=159.50..382.67 rows=164 width=50)'),
(' Hash Cond: (tableA.itemId = tableB.itemId)'),
(' -> Index Scan using userId on tableA (cost=0.57..208.31 rows=5185 width=50)'),
(' Index Cond: (userId = 4000)'),
(' -> Hash (cost=117.05..117.05 rows=3350 width=4)'),
(' Buckets: 4096 Batches: 1 Memory Usage: 161kB',),
(' -> Index Scan using categoryId on tableB (cost=0.42..117.05 rows=3350 width=4)'),
(' Index Cond: (categoryId = 1002)',), ('Planning time: 0.149 ms',)]