How does a B+tree handle a combination of AND, OR, IN and equals?

Question

How do these 4 types of queries take advantage of indexes? What does the scan look like?

WHERE status = "foo"

WHERE id IN (1, 2, 3)

WHERE id IN (1, 2, 3) AND status = "foo"

WHERE id IN (1, 2, 3) OR status = "foo"

In the first case, I think this is a B+tree with the key being the status. Easy enough. But wait, it needs to store multiple items per status, so maybe it has an array (generally speaking) of the records for each status.

But for the second query, it seems you would just have the index be for id and just fetch from the B+tree each key one id at a time, so it would do tree.get(id) for each id. But that is already seeming less than ideal. How is it actually done?

Then take it further and combine the two query types, you can only use one of the indexes now (say the id index, not the status index). Then you get the subset of records matching these IDs, and iterate through them and check the status.

Now we are starting to seem really inefficient.

Same with the OR query.

How are these typically implemented in a database, generally or ideally speaking?

I am asking because I would like to implement a basic version of this in JavaScript for the browser. Basically, what the best way is to have multiple (potentially multi-columned) indexes on a table. So I can store a record in this "table", it gets stored in every index, and then on a query it fetches from the "best" index. I am not really sure how this works at a high level (high level yet very deep in terms of data-structure/algorithm implementation) to get started.

This is the template I am basically starting with:

class Index {
  constructor(fields = ['id']) {
    this.fields = fields
    this.tree = new Tree
  }

  insert(record) {
    this.tree.insert(this.getKey(record), block)
  }

  remove(record) {
    this.tree.remove(this.getKey(record))
  }

  check(record) {
    return this.tree.check(this.getKey(record))
  }

  getKey(record) {
    return this.fields.map(field => record[field]).join('')
  }
}

class Table {
  constructor() {
    this.index = []
  }

  insert(record) {
    this.index.forEach(index => index.insert(record))
  }

  select(query) {
    // query processing
  }

  remove(id) {
    
  }
}

So basically, for each table you create several indexes. When you insert a record, it gets the key for each index and inserts it into a Tree (the B+tree that acts like a key/value store). From there I don't know how to properly use the indexes, or if I'm even on the right track. I would ask how an ideal relational database would implement this, but that would likely get downvoted as being too general :/ but that's what I'm actually trying to build.

I have this B+tree as an example to work with.

Honestly, it might be more educational for you if you just run `EXPLAIN` on all 4 queries. Generally, when you have `OR` logic in the `WHERE` clause, then each condition might require a separate walk down the B-tree. Two conditions ANDed, on the other hand, can be covered by a single index. But `WHERE id IN (1, 2, 3)` is the same as saying `WHERE id = 1 OR id = 2 OR id = 3`. — Tim Biegeleisen, Apr 30 '21 at 04:19
Database engines will do a "quick" analysis of the estimated cost of doing the query, often based on statistics it maintains about the involved tables. So it might be that for `IN (1, 2, 3)` it will do three searches in the index, while for `IN (1, 2, 3, 4, 5)` it might decide to run a table scan and not use the index at all. — trincot, Apr 30 '21 at 08:03
Also realise that for the second query, there could be the use of a single index on `id` and `status`. — trincot, Apr 30 '21 at 08:17

Matt Timmermans · Accepted Answer · 2021-04-30T13:44:04.120

You don't seem to be restricted in the indexes you can have, so lets assume you have an index on (id) and an index on (status, id). I'm also going to assume that id is a primary key or has a uniqueness constraint, as IDs usually do:

WHERE status = "foo"

The range of items that match the status is efficiently read out of the (status,id) index.

WHERE id IN (1, 2, 3)

Assuming id is an integral type, the range of items with id >=1 and <=3 is read out of the (id) index. The index is ordered and finding a range of consecutive values is no more difficult than finding a single value.

WHERE id IN (1, 2, 3) AND status = "foo"

This matches a consecutive range in the (status, id) index.

WHERE id IN (1, 2, 3) OR status = "foo"

The (1,2,3) range is selected from the (id) index and the "foo" range is selected from the (status, id) index. The results are then merged. Since both ranges have distinct rows in the same order, they can be merged efficiently like the merge operation in merge sort.

If you want to be able to do the same sorts of things with your own index class, you need to support indexes on multiple columns, and you need to be able to get an iterator for the rows in the index, starting at a given key.

Ah for some reason I was missing the last part when trying to imagine. Thanks! — Lance, Apr 30 '21 at 21:48

score 0 · Answer 2 · answered Dec 24 '21 at 17:44

I'll address this specifically for MySQL/MariaDB. The specifics may vary with other vendors. I have changed away from "1,2,3" to avoid the temptation to assume the values are consecutive. I am also changing away from "id" because id is the PRIMARY KEY.

MySQL will use a B+Tree.

WHERE status = "foo"
    INDEX(status)       -- best
    INDEX(status, ...)  -- nearly as good
    If a nontrivial number of rows have "foo", it won't bother using any index!

WHERE bar IN (123, 456, 789)
    INDEX(bar)  -- It will do multiple BTree lookups.

WHERE bar IN (123, 456, 789) AND status = "foo"
    INDEX(status, bar)   -- In this order

WHERE bar IN (123, 456, 789) OR status = "foo"
    No index is likely to be beneficial; it will do a table scan.
    It would probably run faster to use two SELECTs and a UNION

If you need to do all 4 queries, then I recommend having these two indexes:

    INDEX(status, bar)  -- helps 1st and 3rd
    INDEX(bar)          -- helps 2nd

Think of concatenating the columns, then using that as a single key into the BTree. (This will keep you from getting distracted by "cardinality" or "selectivity" of the individual columns.)

This does not get into "clustering" and "index merge" and many other topics.

How does a B+tree handle a combination of AND, OR, IN and equals?

2 Answers2