If you have absolutely no data about your problem, you are forced to do an estimate.
The general form of that formula is explained in the comments:
- if we use just one keycolumn (
x
) of a multicolumn index (with c
columns), we get a
rows (1% of total rows). So for x=1
, the result is a
by definition.
- if we know the value for every keycolumn of a multicolumn index, we get the number of rows per whole key (
b
); so for x=c
, we get b
rows (which is 1 or 10) by definition.
- in between (if we use keyvalues for more than 1 key column, but not all), for each additional known keyvalue, we can exclude some additional rows: we have
a-b
rows that will not belong to the case where we know our full key (which would have b
rows), and by definition they shall be excluded proportionally to the ratio of usable keycolumns ((x-1)/(c-1)
).
- The
-1
in (x-1)/(c-1)
is just a shift (you could just use different variable names), since we only need to count the additional columns, but c
and x
is the count including the first column. (In a time series, you would call the parameter for the first column t=0
, and the -1
does exactly that).
So in conclusion we get a - (a-b) * (x-1)/(c-1)
(a
for the first key column minus the rows we proportionally exclude). This is (if you transform that expression a bit) exactly the formula given. A quick sanity check: For x=1
(x-1=0
), the second term is 0 and we get a
, as defined by the first condition; for x=c
, we get a-(a-b)=b
as defined by the second condition.
It is not unreasonable to make this ansatz using these assumptions, but you can probably find a different formula that makes as much sense. To argue that it is better would be a harder task though.
Then there is the matter of choosing the values (b=10
and 1%
in this case). You can obviously choose any value. For doing this without any reliable data except a gut feeling, there is a concept called the Fermi estimate:
The estimation technique is named after physicist Enrico Fermi as he was known for his ability to make good approximate calculations with little or no
actual data.
You basically choose just the order of magnite (1, 1000000, 1/100) for your input parameters, and you get a reasonable order of magnitude for your result.
So how many rows do you expect a non-unique key to cover? It's more than 1, otherwise you would make it a unique key, but is it more like 2, 10 or 100? 10 is probably a good guess (it covers value from about 3 to 30 in that estimate). So although this numbers could have come from a 2 year worldwide survey about key distribution, estimated values in powers of 10 are usually derived in a fashion like that. If you want to be absolutely certain, ask the developer.
And the obligatory xkcd for this kind of topics: What-if? Paint the Earth