How to Optimize an Inner Self Join?

Question

I have 1 million buckets of balls. Every ball has a number of attributes (and not all attributes make sense for every ball – e.g. a golf ball cannot be inflated/deflated).

Users look for buckets and balls based on specific ball requirements. For example, a user may ask which buckets contain a golf ball, a red ball and red soccer ball? In this example, a bucket with a golf ball and a single red soccer ball but no other red balls does not satisfy the query (we need at least one other red ball; two red soccer balls would do it. In other words, one ball can't simultaneously satisfy multiple parts of the query). So the query would look something like this:

[{
    ball_type: "golf"
},{
    color: "red"
},{
    color: "red",
    ball_type: "soccer"
}]

The application returns both bucket ids and the ball ids that satisfy the requirements within each bucket (preferably in a way that associates the matching ball id with the query it matches on). For example, a possible return would look like this:

[{
    bucket_id: 1,
    ball_ids: [{
        query: “golf ball”,
        ball_id: 1,
    },{
        query: “red ball”,
        ball_id: 6,
    },{
        query: “red soccer ball”,
        ball_id: 14,
    }]
},
 ... other matches ...
]

But this would do just as well:

    bucket_id: 1,
    ball_ids: {
        query0_ball_id: 1,
        query1_ball_id: 6,
        query2_ball_id: 14
    }

To be clear, I have tried a number of solutions to this problem and have been benchmarking different options. I’m asking this question because I’m hoping that there are more solutions or optimizations that I’m not aware of (I have already indexed the table columns but I’m open to ideas for better indices).

My current solution is to use a table that looks like this:

ball_id | ball_type | color | weight | is_inflated | bucket_id
--------------------------------------------------------------
0       | cricket   | red   | 157    | NULL        | 1
1       | golf      | green | 45.93  | NULL        | 1
[etc.]

Queries then rely on self inner joins (FROM balls as b0 INNER JOIN balls as b1) where b0.bucket_id = b1.bucket_id. Something like:

SELECT
    b0.bucket_id AS bucket_id,
    b0.ball_id AS ball_0,
    b1.ball_id AS ball_1,
    b2.ball_id AS ball_2
FROM balls AS b0
INNER JOIN balls AS b1 ON
    b0.bucket_id = b1.bucket_id
INNER JOIN balls AS b2 ON
    b0.bucket_id = b2.bucket_id
WHERE
        -- each ball must be unique
        b0.ball_id != b1.ball_id AND b0.ball_id != b2.ball_id AND b1.ball_id != b2.ball_id

    -- match balls to requirements
    AND
        b0.ball_type = 'golf'
    AND
        b1.color = 'red'
    AND
        b2.color = 'red' AND b2.ball_type = 'soccer'

This is pretty performant when I’m only looking for buckets that have two types of balls in them. It’s also fine when I look for multiple balls that are rare overall.

The problem is that when I search for three balls that all occur 10,000 times, the joins become incredibly expensive even if the result is only 30 buckets. So queries that I want running under 100ms take more than 5000ms (and obviously adding just one more common ball to the query will multiply the problem). Note: the most expensive part of my (postgres) query appears to be a “Parallel Bitmap Heap Scan” the meaning of which I do not yet adequately understand.

How can I optimize my current solution? (or is there some other thing I should really switch to that is designed for this problem).

I’m open to alternative data structures but I think that makes this question too open. I want to stick with my postgres db and the structure I’m describing above unless it's a terrible idea for some reason because: (1) it makes it easier conceptually for other devs to enter into this project, (2) it requires less time from me to learn the underlying tech (this stuff is only a hobby).

"For example, a user may ask which buckets contain a golf ball, a red ball and red soccer ball? In this example, a bucket with a single red soccer ball and no other red balls does not satisfy the query (we need another red ball; two red soccer balls would do it)." -- How could two red soccer balls do? There's missing a golf ball, isn't it? — sticky bit, Jun 02 '19 at 00:51
"The application must return both bucket ids and the ball ids that satisfy the requirements" This is not clear. What can the user ask about what? What is the input structure/type--for user asking & within the DB--& what are their possible values? What is the output structure/type & how is output a function of input? PS You can't make everything fast. Once you give behaviour exactly you can propose implementations that you can guestimate have certain space & time usage. Clearly distinguish between abstract user requests/results vs structures representing them vs DB queries they map to. — philipxy, Jun 02 '19 at 04:21
Thanks for the question @stickybit. I have updated that part of the explanation but in essence the point is that "one ball can't simultaneously satisfy multiple parts of the query". — jcuenod, Jun 02 '19 at 20:20
@philipxy The user input request is an arbitrary length list of balls matching criteria. The return value is every matching bucket and a list of balls that satisfy each part of the query. I have updated the question to include an example request and I added a SQL statement to give you an idea of what I'm doing. I am communicating as much as I am about structure because this kind of search is the primary goal that I am trying to achieve so if it turns out that this is a bad structure to achieve the goal, then I want to be told that. And yes, I understand that there are NP-hard problems here. — jcuenod, Jun 02 '19 at 20:36
You still need to be clear enough that someone could write the user manual & someone could implement that.--Give the relevant excerpts from the user manual. PS You offer the user a restricted-functionality DBMS. You seem to understand you can delegate to your DBMS. Straightforward is a ball-bucket table & a table for each (sub)type. [How can you represent inheritance in a database?](https://stackoverflow.com/q/3579079/3404097) [Beware of EAV.](https://stackoverflow.com/a/23950836/3404097) If certain aggregate queries are expensive explore caching relevant aggregations in the schema. — philipxy, Jun 02 '19 at 21:36
@philipxy Thanks for the heads up on EAV. I'm currently working with a cache of each possible key/value pair and then I'm intersecting them which works decently. I had already tried that when I wrote this but I was hoping there was a better solution out there. — jcuenod, Jun 17 '19 at 02:12

How to Optimize an Inner Self Join?

0 Answers0