Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

Question

I am trying to optimize the performance of a simple query to a SQLite database by using indexing. As an example, the table has 5M rows, 5 columns; the SELECT statement is to pick up all columns and the WHERE statement checks for only 2 columns. However, unless I have all columns in the multi-column index, the performance of the query is worse than without any index.

Did I index the column incorrectly, or when selecting all columns, am I supposed to include all of them in the index in order to improve performance?

Below each case # is the result I got when creating the SQLite database in hard-disk. However, for some reason using the ':memory:' mode made all the indexing cases faster than without index.

import sqlite3
import datetime
import pandas as pd
import numpy as np
import os
import time

# Simulate the data
size = 5000000
apps = [f'{i:010}' for i in range(size)]
dates = np.random.choice(pd.date_range('2016-01-01', '2019-01-01').to_pydatetime().tolist(), size)
prod_cd = np.random.choice([f'PROD_{i}' for i in range(30)], size)
models = np.random.choice([f'MODEL{i}' for i in range(15)], size)
categories = np.random.choice([f'GROUP{i}' for i in range(10)], size)

# create a db in memory
conn = sqlite3.connect(':memory:', detect_types=sqlite3.PARSE_DECLTYPES)
c = conn.cursor()
# Create table and insert data
c.execute("DROP TABLE IF EXISTS experiment")
c.execute("CREATE TABLE experiment (appId TEXT, dtenter TIMESTAMP, prod_cd TEXT, model TEXT, category TEXT)")
c.executemany("INSERT INTO experiment VALUES (?, ?, ?, ?, ?)", zip(apps, dates, prod_cd, models, categories))

# helper functions
def time_it(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        print("time for {} function is {}".format(func.__name__, time.time() - start))
        return result
    return wrapper

@time_it
def read_db(query):
    df = pd.read_sql_query(query, conn)
    return df

@time_it
def run_query(query):
    output = c.execute(query).fetchall()
    print(output)

# The main query
query = "SELECT * FROM experiment WHERE prod_cd IN ('PROD_1', 'PROD_5', 'PROD_10') AND dtenter >= '2018-01-01'"

# CASE #1: WITHOUT ANY INDEX
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 2.4783718585968018

# CASE #2: WITH INDEX FOR COLUMNS IN WHERE STATEMENT
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 3.221407890319824
# CASE #3: WITH INDEX FOR MORE THEN WHAT IN WHERE STATEMENT, BUT NOT ALL COLUMNS 
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>>time for read_db function is 3.176532745361328

# CASE #4: WITH INDEX FOR ALL COLUMNS 
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category, model)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 0.8257918357849121

score 3 · Answer 1 · answered Aug 05 '19 at 06:30

The SQLite Query Optimizer Overview says:

When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches.

Index entries are not in the same order as the table entries, so if a query returns data from most of the table's pages, all those random-access lookups are slower than just scanning all table rows.

Index lookups are more efficient than a table scan only if your WHERE condition filters out much more rows than are returned.

SQLite assumes that lookups on indexed columns have a high selectivity. You can get better estimates by running ANALYZE after filling the table.
But if all your queries are in a form where an index does not help, it wold be a better idea to not use an index at all.

When you create an index over all columns used in the query, the additional table accesses are no longer necessary:

If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.

When an index contains all of the data needed for a query and when the original table never needs to be consulted, we call that index a "covering index".

Thank you for explaining! It became clearer now. But: 1. When you mentioned index lookups are more efficient when used to filter out much more rows than are returned, is there any approximated percent for this? My example above filters out >95% rows already. 2. The full table has 5M rows, so I would imagine 2 binary searches using Log(n) will still be much faster than a full scan of 5M rows. Is my understanding correct? — Hieu Tran, Aug 05 '19 at 12:09
The exact threshold depends on the data, the software, and the hardware. When doing a table scane, the individual pages are likely to be stored consecutively on disk. — CL., Aug 05 '19 at 14:59

Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

1 Answers1