3

During my reading to the book [The internals of PostgreSQL] I have been in the chapter 3 Book chapter link: https://www.interdb.jp/pg/pgsql03.html
at 3.2.2. Index Scan
specifically at 3.2.2.1. Start-Up Cost

The following query is being explained and the cost of the index is to be counted here

SELECT id, data FROM tbl WHERE data < 240;

Before estimating the cost, the numbers of the index pages and index tuples, N_index_page, N_index_tuples

testdb=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'tbl_data_idx';
 relpages | reltuples 
----------+-----------
       30 |     10000
(1 row)

Nindex,tuple = 10000
Nindex,page = 30

It is mentioned that

The start-up cost of the index scan is the cost to read the index pages to access to the first tuple in the target table, and it is defined by the following equation:

start_up cost’ = {ceil(log2(Nindex,tuple)) + (Hindex + 1) × 50} × cpu_operator_cost

Where Hindex: is the height of the index tree.

Nindex,tuple is 10000; Hindex is 1; cpu_operator_cost is 0.0025 (by default). Thus,

start_up cost’ = {ceil(log2(10000)) + (1 + 1) × 50} × 0.0025 = 0.285

My question comes here at that equation What is that 50? Is it a heuristic number has a statistically meaning?

Cost size source code link (has not mentioned that 50 at all from my narrow checking) https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/costsize.c#L502-L792

I have tested that on Postgres server to make sure of the startup cost estimation and found that

test=# EXPLAIN SELECT id, data FROM tbl WHERE data < 240;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Scan using tbl_data_idx on tbl  (cost=0.29..13.47 rows=239 width=8)
   Index Cond: (data < 240)
(2 rows)

Cost is 0.29 Same as (math.ceil(math.log2(10000))+(1+1)*50)*0.0025

Same as which is mentioned on the book

  • 1
    Good question. I would guess that 50 is a magic number derived from empirical analysis. But this is just a guess. – The Impaler Mar 14 '23 at 18:54

1 Answers1

4

Use the power of the source, Luke!

That formula is taken straight from the source. See btcostestimate in src/backend/utils/adt/selfuncs.c:

    /*
     * Add a CPU-cost component to represent the costs of initial btree
     * descent.  We don't charge any I/O cost for touching upper btree levels,
     * since they tend to stay in cache, but we still have to do about log2(N)
     * comparisons to descend a btree of N leaf tuples.  We charge one
     * cpu_operator_cost per comparison.
     *
     * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
     * ones after the first one are not startup cost so far as the overall
     * plan is concerned, so add them only to "total" cost.
     */
    if (index->tuples > 1)      /* avoid computing log(0) */
    {
        descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost;
        costs.indexStartupCost += descentCost;
        costs.indexTotalCost += costs.num_sa_scans * descentCost;
    }

    /*
     * Even though we're not charging I/O cost for touching upper btree pages,
     * it's still reasonable to charge some CPU cost per page descended
     * through.  Moreover, if we had no such charge at all, bloated indexes
     * would appear to have the same search cost as unbloated ones, at least
     * in cases where only a single leaf page is expected to be visited.  This
     * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
     * touched.  The number of such pages is btree tree height plus one (ie,
     * we charge for the leaf page too).  As above, charge once per SA scan.
     */
    descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
    costs.indexStartupCost += descentCost;

So the idea is that there is some cost associated with traversing each non-leaf index page, and that cost is “arbitrarily” pegged at 50 comparison operator executions.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263