I have 1 million buckets of balls. Every ball has a number of attributes (and not all attributes make sense for every ball – e.g. a golf ball cannot be inflated/deflated).
Users look for buckets and balls based on specific ball requirements. For example, a user may ask which buckets contain a golf ball
, a red ball
and red soccer ball
? In this example, a bucket with a golf ball and a single red soccer ball but no other red balls does not satisfy the query (we need at least one other red ball; two red soccer balls would do it. In other words, one ball can't simultaneously satisfy multiple parts of the query). So the query would look something like this:
[{
ball_type: "golf"
},{
color: "red"
},{
color: "red",
ball_type: "soccer"
}]
The application returns both bucket ids
and the ball ids
that satisfy the requirements within each bucket (preferably in a way that associates the matching ball id with the query it matches on). For example, a possible return would look like this:
[{
bucket_id: 1,
ball_ids: [{
query: “golf ball”,
ball_id: 1,
},{
query: “red ball”,
ball_id: 6,
},{
query: “red soccer ball”,
ball_id: 14,
}]
},
... other matches ...
]
But this would do just as well:
bucket_id: 1,
ball_ids: {
query0_ball_id: 1,
query1_ball_id: 6,
query2_ball_id: 14
}
To be clear, I have tried a number of solutions to this problem and have been benchmarking different options. I’m asking this question because I’m hoping that there are more solutions or optimizations that I’m not aware of (I have already indexed the table columns but I’m open to ideas for better indices).
My current solution is to use a table that looks like this:
ball_id | ball_type | color | weight | is_inflated | bucket_id
--------------------------------------------------------------
0 | cricket | red | 157 | NULL | 1
1 | golf | green | 45.93 | NULL | 1
[etc.]
Queries then rely on self inner joins (FROM balls as b0 INNER JOIN balls as b1
) where b0.bucket_id = b1.bucket_id
. Something like:
SELECT
b0.bucket_id AS bucket_id,
b0.ball_id AS ball_0,
b1.ball_id AS ball_1,
b2.ball_id AS ball_2
FROM balls AS b0
INNER JOIN balls AS b1 ON
b0.bucket_id = b1.bucket_id
INNER JOIN balls AS b2 ON
b0.bucket_id = b2.bucket_id
WHERE
-- each ball must be unique
b0.ball_id != b1.ball_id AND b0.ball_id != b2.ball_id AND b1.ball_id != b2.ball_id
-- match balls to requirements
AND
b0.ball_type = 'golf'
AND
b1.color = 'red'
AND
b2.color = 'red' AND b2.ball_type = 'soccer'
This is pretty performant when I’m only looking for buckets that have two types of balls in them. It’s also fine when I look for multiple balls that are rare overall.
The problem is that when I search for three balls that all occur 10,000 times, the joins become incredibly expensive even if the result is only 30 buckets. So queries that I want running under 100ms take more than 5000ms (and obviously adding just one more common ball to the query will multiply the problem). Note: the most expensive part of my (postgres) query appears to be a “Parallel Bitmap Heap Scan” the meaning of which I do not yet adequately understand.
How can I optimize my current solution? (or is there some other thing I should really switch to that is designed for this problem).
I’m open to alternative data structures but I think that makes this question too open. I want to stick with my postgres db and the structure I’m describing above unless it's a terrible idea for some reason because: (1) it makes it easier conceptually for other devs to enter into this project, (2) it requires less time from me to learn the underlying tech (this stuff is only a hobby).