How was this records/fanout logic derived for the "no statistics" case in MySQL's Query Planner?

Question

In the MySQL Server 5.7 source code, the formula records = (x * (b-a) + a*c-b)/(c-1) is used in the query planner to calculate the number of records when key distribution statistics are not available.

Where is this formula coming from, how was it derived, or why is this specific formula the formula that's being used? Does it have an established theoretical underpinning, and if so, what is its basis?

https://github.com/mysql/mysql-server/blob/5.7/sql/sql_planner.cc#L529

          Assume that the first key part matches 1% of the file
          and that the whole key matches 10 (duplicates) or 1
          (unique) records.
          Assume also that more key matches proportionally more
          records
          This gives the formula:
          records = (x * (b-a) + a*c-b)/(c-1)
          b = records matched by whole key
          a = records matched by first key part (1% of all records?)
          c = number of key parts in key
          x = used key parts (1 <= x <= c)

Solarflare · Accepted Answer · 2017-08-08T17:20:38.247

If you have absolutely no data about your problem, you are forced to do an estimate.

The general form of that formula is explained in the comments:

if we use just one keycolumn (x) of a multicolumn index (with c columns), we get a rows (1% of total rows). So for x=1, the result is a by definition.
if we know the value for every keycolumn of a multicolumn index, we get the number of rows per whole key (b); so for x=c, we get b rows (which is 1 or 10) by definition.
in between (if we use keyvalues for more than 1 key column, but not all), for each additional known keyvalue, we can exclude some additional rows: we have a-b rows that will not belong to the case where we know our full key (which would have b rows), and by definition they shall be excluded proportionally to the ratio of usable keycolumns ((x-1)/(c-1)).
The -1 in (x-1)/(c-1) is just a shift (you could just use different variable names), since we only need to count the additional columns, but c and x is the count including the first column. (In a time series, you would call the parameter for the first column t=0, and the -1 does exactly that).

So in conclusion we get a - (a-b) * (x-1)/(c-1) (a for the first key column minus the rows we proportionally exclude). This is (if you transform that expression a bit) exactly the formula given. A quick sanity check: For x=1 (x-1=0), the second term is 0 and we get a, as defined by the first condition; for x=c, we get a-(a-b)=b as defined by the second condition.

It is not unreasonable to make this ansatz using these assumptions, but you can probably find a different formula that makes as much sense. To argue that it is better would be a harder task though.

Then there is the matter of choosing the values (b=10 and 1% in this case). You can obviously choose any value. For doing this without any reliable data except a gut feeling, there is a concept called the Fermi estimate:

The estimation technique is named after physicist Enrico Fermi as he was known for his ability to make good approximate calculations with little or no actual data.

You basically choose just the order of magnite (1, 1000000, 1/100) for your input parameters, and you get a reasonable order of magnitude for your result.

So how many rows do you expect a non-unique key to cover? It's more than 1, otherwise you would make it a unique key, but is it more like 2, 10 or 100? 10 is probably a good guess (it covers value from about 3 to 30 in that estimate). So although this numbers could have come from a 2 year worldwide survey about key distribution, estimated values in powers of 10 are usually derived in a fashion like that. If you want to be absolutely certain, ask the developer.

And the obligatory xkcd for this kind of topics: What-if? Paint the Earth

“The general form of that formula is explained in the comments” , do you mean the formula "(x * (b-a) + a*c-b)/(c-1)" is the implementation for the comments "Assume that the first key part matches 1% of the file....key matches proportionally more records"? Can you kindly give more detail explanation for relationship between the formula and comments?As I cannot see there two have any relationship. — YuFeng Shen, Aug 08 '17 at 14:29
@Jason Yes, I meant that comment. I added an explanation for the formula in the answer. — Solarflare, Aug 08 '17 at 17:17
Thank you so much ,I need some time to study from your answer and response to you if I have any questions. — YuFeng Shen, Aug 09 '17 at 15:43

How was this records/fanout logic derived for the "no statistics" case in MySQL's Query Planner?

1 Answers1

Linked