5

Context

I have an application that selects a weighted random entry from a table for which prefix summation (of weights) is a crucial part. The simplified table definition looks like this:

CREATE TABLE entries (
    id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    weight DECIMAL(9, 3),
    fenwick DECIMAL(9, 3)
) ENGINE=MEMORY;

where `fenwick` stores the values within the Fenwick tree representation of `weights`.

Let the "range" of each entry spans between its prefix sum and its prefix sum + its weight. The application must generate a random number @r between 0 and SUM(weight) and finds the entry whose range encompasses @r, like this:

Weighted random entry selection

The Fenwick tree, combined with the MEMORY engine and a binary search, should allow me to find the appropriate entry in O(lg^2(n)) time, as opposed to O(n) time with the naive query:

SELECT a.id-1 FROM (SELECT *, (@x:=@x+weight) AS counter FROM entries 
    CROSS JOIN (SELECT @x:=0) a
    HAVING counter>@r LIMIT 1) a;

Research

I have been trying to condense the prefix sum operation into one query (as opposed to several array accesses seen in scripting languages) due to the overhead of multiple queries. In the process, I've realized that the traditional method of summation, which involves accessing elements in descending key order, would only sum the first element. I was suspicious that MySQL runs through tables linearly when variables are present in the WHERE clause. Here's the query:

SELECT
SUM(1) INTO @garbage
FROM entries 
CROSS JOIN (
    SELECT @sum:=0,
        @n:=@entryid
) a
WHERE id=@n AND @n>0 AND (@n:=@n-(@n&(-@n))) AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/

where @entryid is the ID of the entry whose prefix sum we are computing. I did create a query that did work (alongside a function lft that returns the leftmost bit of an integer):

SET @n:=lft(@entryid);
SET @sum:=0;
SELECT
    SUM(1) INTO @garbage
    FROM entries
    WHERE id=@n 
      AND @n<=@entryid 
      AND (@n:=@n+lft(@entryid^@n)) 
      AND (@sum:=@sum+entries.fenwick);
/*SELECT @sum*/

but it only confirmed my suspicion of a linear search. So too does the EXPLAIN query:

+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
| id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
|    1 | SIMPLE      | entries | ALL  | NULL          | NULL | NULL    | NULL | 752544 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

The indexes:

SHOW INDEXES FROM entries;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| entries |          0 | PRIMARY  |            1 | id          | NULL      |       752544 |     NULL | NULL   |      | HASH       |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

Now, I have seen many a question asking how to eliminate variables in the WHERE clause so that the optimizer can work on the query. However, I can't think of a way this query can do without id=@n. I've contemplated putting the key values of entries I want to sum into a table and using joins, but I believe that I'll get undesirable effects: either a plethora of tables, or a linear search by evaluating against @entryid anyways.

Question

Is there any way to force MySQL to use the indices for this query? I will even try a different DBMS if they offer this functionality.

wsdookadr
  • 2,584
  • 1
  • 21
  • 44
concat
  • 3,107
  • 16
  • 30
  • Looked at https://dev.mysql.com/doc/refman/5.1/en/index-hints.html already? – Norbert Apr 06 '15 at 17:17
  • @NorbertvanNobelen Yes I have; unfortunately even `FORCE INDEX (PRIMARY)` has no effect. – concat Apr 06 '15 at 17:50
  • Can you add the show index from table entries? – Norbert Apr 06 '15 at 22:13
  • @variables seem to disable optimizations in some (or all?) situations. What version are you using? I tried something like your first SELECT against 5.6.12 and it seemed to "do the right thing". – Rick James Apr 07 '15 at 00:33
  • Without an ORDER BY, MySQL will not guarantee that the rows are scanned in any particular order, especially for MEMORY. – Rick James Apr 07 '15 at 00:34
  • @RickJames True: "full-table scan" is more the term I'm looking for, and you have a point. I've tested on 5.6.20 and MariaDB 10.0 with `SELECT * FROM entries` returning ids in ascending order: hence the failure of the first query. Admittedly, the query as it is is risky at best, which reinforces the need for an alternative. – concat Apr 07 '15 at 01:13
  • Try adding a unique index on id,fenwick. It might just be the case that the optimizer decides if it needs fenwick, it better just reads the whole table instead of first index and then table row. See if explain then shows possible keys – Norbert Apr 07 '15 at 01:28
  • @NorbertvanNobelen It now says that it's using the PRIMARY index, but it continues to reference all of the rows. [Full output](http://pastebin.com/DSTr2Uej) (Inserting has slowed considerably, so I've decreased the number of rows to 2000.) – concat Apr 07 '15 at 01:47
  • Strange. Can you create an SQL Fiddle so I can see all code running (100 records perhaps). – Norbert Apr 07 '15 at 01:55
  • @NorbertvanNobelen My bad: I forgot to specify `ENGINE=MEMORY` when rebuilding the table for testing! I reran with a `MEMORY` table and `EXPLAIN` came back with the same results as the initial test. Nevertheless, [here's an SQLFiddle](http://sqlfiddle.com/#!9/be1f2/1). (`lft` calls have been replaced with the function's raw code) – concat Apr 07 '15 at 02:26
  • 1
    Note that the `PRIMARY KEY` is `HASH`. Hash keys are _not_ ordered, hence useless for this goal. Please try it with MyISAM or InnoDB. `MEMORY` prefers `Hash`; other Engines have only `BTree`, which is useful. – Rick James Apr 07 '15 at 04:45
  • @RickJames I'm not sure I follow. How would ordered keys like a `BTREE` help? – concat Apr 07 '15 at 05:57
  • If I understand the computation, you want to do a "cumulative sum" and stop when it reaches some value. Such a summation is not meaningful without spelling out what order to add the items in. – Rick James Apr 07 '15 at 06:52

4 Answers4

3

Disclaimer

Fenwick trees are new to me, I only discovered them while finding this post. The results presented here are based on my understanding and some research, but I am by no means a fenwick tree expert, I might have missed things.

Reference materials

Explanation of how fenwick tree works

https://stackoverflow.com/a/15444954/1157540 reproduced from https://cs.stackexchange.com/a/10541/38148

https://cs.stackexchange.com/a/42816/38148

Usage of fenwick trees

https://en.wikipedia.org/wiki/Fenwick_tree

https://en.wikipedia.org/wiki/Prefix_sum

Problem 1, finding the weight of a given entry

Given the following table

CREATE TABLE `entries` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `weight` decimal(9,3) DEFAULT NULL,
  `fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
  PRIMARY KEY (`id`)
) ENGINE=INNODB;

and given data already populated (see the http://sqlfiddle.com/#!9/be1f2/1 provided by concat), how to count for weight for a given entry @entryid ?

The key concept to understand here, is that the structure of the fenwick index is based on math and bitwise operations on the id values themselves.

Queries should typically use primary key lookups only (WHERE ID = value).

Any query using sorting (ORDER BY) or ranges (WHERE (value1 < ID) AND (ID < value2)) misses the point, and does not traverse the tree in the intended order.

For example, with the key 60:

SET @entryid := 60;

Let's decompose the value 60 in binary

mysql> SELECT (@entryid & 0x0080) as b8,
    ->        (@entryid & 0x0040) as b7,
    ->        (@entryid & 0x0020) as b6,
    ->        (@entryid & 0x0010) as b5,
    ->        (@entryid & 0x0008) as b4,
    ->        (@entryid & 0x0004) as b3,
    ->        (@entryid & 0x0002) as b2,
    ->        (@entryid & 0x0001) as b1;
+------+------+------+------+------+------+------+------+
| b8   | b7   | b6   | b5   | b4   | b3   | b2   | b1   |
+------+------+------+------+------+------+------+------+
|    0 |    0 |   32 |   16 |    8 |    4 |    0 |    0 |
+------+------+------+------+------+------+------+------+
1 row in set (0.00 sec)

In other words, keeping only the bits set, we have

32 + 16 + 8 + 4 = 60

Now, remove the lowest bits set one by one to navigate the tree:

32 + 16 + 8 + 4 = 60
32 + 16 + 8 = 56
32 + 16 = 48
32

This gives the path (32, 48, 56, 60) to access element 60.

Note that transforming 60 to (32, 48, 56, 60) only requires bit math on the ID value itself: no access to the table or the database is needed, and this computation can be done in the client issuing the query.

The fenwick weigth of element 60 is then

mysql> select sum(fenwick) from entries where id in (32, 48, 56, 60);
+--------------+
| sum(fenwick) |
+--------------+
|       32.434 |
+--------------+
1 row in set (0.00 sec)

Verification

mysql> select sum(weight) from entries where id <= @entryid;
+-------------+
| sum(weight) |
+-------------+
|      32.434 |
+-------------+
1 row in set (0.00 sec)

Now, let's compare the efficiency of these queries.

mysql> explain select sum(fenwick) from entries where id in (32, 48, 56, 60);
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | entries | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL |    4 |   100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+

or, presented differently

explain format=json select sum(fenwick) from entries where id in (32, 48, 56, 60);
{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "5.61"
    },
    "table": {
      "table_name": "entries",
      "access_type": "range",
      "possible_keys": [
        "PRIMARY"
      ],
      "key": "PRIMARY",
      "used_key_parts": [
        "id"
      ],
      "key_length": "4",
      "rows_examined_per_scan": 4,
      "rows_produced_per_join": 4,
      "filtered": "100.00",
      "cost_info": {
        "read_cost": "4.81",
        "eval_cost": "0.80",
        "prefix_cost": "5.61",
        "data_read_per_join": "64"
      },
      "used_columns": [
        "id",
        "fenwick"
      ],
      "attached_condition": "(`test`.`entries`.`id` in (32,48,56,60))"
    }
  }

So, the optimizer fetched 4 rows by primary key (there are 4 values in the IN clause).

When not using the fenwick index, we have

mysql> explain select sum(weight) from entries where id <= @entryid;
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | entries | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL |   60 |   100.00 | Using where |
+----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+

Or, presented differently

explain format=json select sum(weight) from entries where id <= @entryid;
{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "25.07"
    },
    "table": {
      "table_name": "entries",
      "access_type": "range",
      "possible_keys": [
        "PRIMARY"
      ],
      "key": "PRIMARY",
      "used_key_parts": [
        "id"
      ],
      "key_length": "4",
      "rows_examined_per_scan": 60,
      "rows_produced_per_join": 60,
      "filtered": "100.00",
      "cost_info": {
        "read_cost": "13.07",
        "eval_cost": "12.00",
        "prefix_cost": "25.07",
        "data_read_per_join": "960"
      },
      "used_columns": [
        "id",
        "weight"
      ],
      "attached_condition": "(`test`.`entries`.`id` <= (@`entryid`))"
    }
  }

Here the optimizer performed an index scan, reading 60 rows.

With ID=60, the benefit of fenwick is 4 fetch compared to 60.

Now, consider how this scales, with values up to 64K for example.

With fenwick, a 16 bits value will have at most 16 bits set, hence the number of elements to lookup will be 16 at most.

Without fenwick, a scan can read up to 64K entries (and will read 32K in average).

Problem 2, finding the entry for a given weight

The OP problem was to find an entry for a given weight.

For example

SET @search_weight := 35.123;

To illustrate the algorithm, this post details how lookups are done (sorry if this is too verbose)

SET @found_id := 0;

First, find how many entries there is.

SET @max_id := (select id from entries order by id desc limit 1);

In the test data, max_id is 156.

Because 128 <= max_id < 256, the highest bit to start the search is 128.

mysql> set @search_id := @found_id + 128;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+-----+---------+----------------+---------+
| id  | fenwick | @search_weight | action  |
+-----+---------+----------------+---------+
| 128 |  66.540 |         35.123 | discard |
+-----+---------+----------------+---------+

Weight 66.540 is greater than our search, so 128 is discarded, move on to the next bit.

mysql> set @search_id := @found_id + 64;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 64 |  33.950 |         35.123 | keep   |
+----+---------+----------------+--------+

Here we need to keep this bit (64), and count the weight found:

set @found_id := @search_id, @search_weight := @search_weight - 33.950;

Then continue to the next bits:

mysql> set @search_id := @found_id + 32;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action  |
+----+---------+----------------+---------+
| 96 |  16.260 |          1.173 | discard |
+----+---------+----------------+---------+

mysql> set @search_id := @found_id + 16;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action  |
+----+---------+----------------+---------+
| 80 |   7.394 |          1.173 | discard |
+----+---------+----------------+---------+

mysql> set @search_id := @found_id + 8;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action  |
+----+---------+----------------+---------+
| 72 |   3.995 |          1.173 | discard |
+----+---------+----------------+---------+

mysql> set @search_id := @found_id + 4;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+---------+
| id | fenwick | @search_weight | action  |
+----+---------+----------------+---------+
| 68 |   1.915 |          1.173 | discard |
+----+---------+----------------+---------+

mysql> set @search_id := @found_id + 2;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 66 |   1.146 |          1.173 | keep   |
+----+---------+----------------+--------+

We found another bit here

set @found_id := @search_id, @search_weight := @search_weight - 1.146;

mysql> set @search_id := @found_id + 1;
mysql> select id, fenwick, @search_weight,
    ->    if (fenwick <= @search_weight, "keep", "discard") as action
    ->    from entries where id = @search_id;
+----+---------+----------------+--------+
| id | fenwick | @search_weight | action |
+----+---------+----------------+--------+
| 67 |   0.010 |          0.027 | keep   |
+----+---------+----------------+--------+

And one more

set @found_id := @search_id, @search_weight := @search_weight - 0.010;

The final search result is:

mysql> select @found_id, @search_weight;
+-----------+----------------+
| @found_id | @search_weight |
+-----------+----------------+
|        67 |          0.017 |
+-----------+----------------+

Verification

mysql> select sum(weight) from entries where id <= 67;        
+-------------+                                               
| sum(weight) |                                               
+-------------+                                               
|      35.106 |                                               
+-------------+                                               

mysql> select sum(weight) from entries where id <= 68;
+-------------+
| sum(weight) |
+-------------+
|      35.865 |
+-------------+

And indeed,

35.106 (fenwick[67]) <= 35.123 (search) <= 35.865 (fenwick[68])

The search look up values to resolve 1 bit at a time, and each lookup result decides the value of the next ID to search.

The queries given here are for illustration. In a real application, the code should just be a loop that contains:

SELECT fenwick from entries where id = ?;

with the application code (or a stored procedure) implementing the logic related to @found_id, @search_id and @search_weight.

General comments

  • Fenwick trees are used for prefix computations. It only makes sense to use these trees if the problem to solve involves prefixes in the first place. Wikipedia has some pointers for applications. The OP did not describes what the fenwick tree was used for, unfortunately.
  • Fenwick trees are a tradeoff between lookup complexity and update complexity, so they only make sense for data with is not static. For static data, computing a full prefix once will be more efficient.
  • The tests performed used an INNODB table, for which the primary key index is sorted, so computing max_id is a simple O(1) operation. If max_id is already available elsewhere, I think even using a MEMORY table with a HASH index on ID will work, as only primary key lookups are used.

P.S.

sqlfiddle is down today, so posting the raw data used (originally provided by concat) so that people interrested can re run the tests.

INSERT INTO `entries` VALUES (1,0.480,0.480),(2,0.542,1.022),(3,0.269,0.269),(4,0.721,2.012),(5,0.798,0.798),(6,0.825,1.623),(7,0.731,0.731),(8,0.181,4.547),(9,0.711,0.711),(10,0.013,0.724),(11,0.930,0.930),(12,0.613,2.267),(13,0.276,0.276),(14,0.539,0.815),(15,0.867,0.867),(16,0.718,9.214),(17,0.991,0.991),(18,0.801,1.792),(19,0.033,0.033),(20,0.759,2.584),(21,0.698,0.698),(22,0.212,0.910),(23,0.965,0.965),(24,0.189,4.648),(25,0.049,0.049),(26,0.678,0.727),(27,0.245,0.245),(28,0.190,1.162),(29,0.214,0.214),(30,0.502,0.716),(31,0.868,0.868),(32,0.834,17.442),(33,0.566,0.566),(34,0.327,0.893),(35,0.939,0.939),(36,0.713,2.545),(37,0.747,0.747),(38,0.595,1.342),(39,0.733,0.733),(40,0.884,5.504),(41,0.218,0.218),(42,0.437,0.655),(43,0.532,0.532),(44,0.350,1.537),(45,0.154,0.154),(46,0.721,0.875),(47,0.140,0.140),(48,0.538,8.594),(49,0.271,0.271),(50,0.739,1.010),(51,0.884,0.884),(52,0.203,2.097),(53,0.361,0.361),(54,0.197,0.558),(55,0.903,0.903),(56,0.923,4.481),(57,0.906,0.906),(58,0.761,1.667),(59,0.089,0.089),(60,0.161,1.917),(61,0.537,0.537),(62,0.201,0.738),(63,0.397,0.397),(64,0.381,33.950),(65,0.715,0.715),(66,0.431,1.146),(67,0.010,0.010),(68,0.759,1.915),(69,0.763,0.763),(70,0.537,1.300),(71,0.399,0.399),(72,0.381,3.995),(73,0.709,0.709),(74,0.401,1.110),(75,0.880,0.880),(76,0.198,2.188),(77,0.348,0.348),(78,0.148,0.496),(79,0.693,0.693),(80,0.022,7.394),(81,0.031,0.031),(82,0.089,0.120),(83,0.353,0.353),(84,0.498,0.971),(85,0.428,0.428),(86,0.650,1.078),(87,0.963,0.963),(88,0.866,3.878),(89,0.442,0.442),(90,0.610,1.052),(91,0.725,0.725),(92,0.797,2.574),(93,0.808,0.808),(94,0.648,1.456),(95,0.817,0.817),(96,0.141,16.260),(97,0.256,0.256),(98,0.855,1.111),(99,0.508,0.508),(100,0.976,2.595),(101,0.353,0.353),(102,0.840,1.193),(103,0.139,0.139),(104,0.178,4.105),(105,0.469,0.469),(106,0.814,1.283),(107,0.664,0.664),(108,0.876,2.823),(109,0.390,0.390),(110,0.323,0.713),(111,0.442,0.442),(112,0.241,8.324),(113,0.881,0.881),(114,0.681,1.562),(115,0.760,0.760),(116,0.760,3.082),(117,0.518,0.518),(118,0.313,0.831),(119,0.008,0.008),(120,0.103,4.024),(121,0.488,0.488),(122,0.135,0.623),(123,0.207,0.207),(124,0.633,1.463),(125,0.542,0.542),(126,0.812,1.354),(127,0.433,0.433),(128,0.732,66.540),(129,0.358,0.358),(130,0.594,0.952),(131,0.897,0.897),(132,0.701,2.550),(133,0.815,0.815),(134,0.973,1.788),(135,0.419,0.419),(136,0.175,4.932),(137,0.620,0.620),(138,0.573,1.193),(139,0.004,0.004),(140,0.304,1.501),(141,0.508,0.508),(142,0.629,1.137),(143,0.618,0.618),(144,0.206,8.394),(145,0.175,0.175),(146,0.255,0.430),(147,0.750,0.750),(148,0.987,2.167),(149,0.683,0.683),(150,0.453,1.136),(151,0.219,0.219),(152,0.734,4.256),(153,0.016,0.016),(154,0.874,0.891),(155,0.325,0.325),(156,0.002,1.217);

P.S. 2

Now with a full sqlfiddle:

http://sqlfiddle.com/#!9/d2c82/1

Community
  • 1
  • 1
Marc Alff
  • 8,227
  • 33
  • 59
  • My goodness, that is one brilliant binary search. Awesome answer! One thing to add: since `id` *must* be sequential, `@max_id` is just the size of the table, which is calculated in O(1). Amazingly, this we can make `entries` a `MEMORY` table and have true O(lg n) searching! – concat Sep 03 '15 at 21:11
  • @concat, True, max_id is the table size, which simplifies things. Don't forget to accept the answer if the problem is resolved. – Marc Alff Sep 04 '15 at 07:12
  • Of course! I just thought I would do performance tests of some query variants first to build on your results. I've designed a query to generate the summation indexes using pure SQL to allow for use within stored procedures. I've included it in a separate answer for formatting purposes. – concat Sep 04 '15 at 16:45
0

(Answer box used because it has the option of formatting).

The ENGINE is the main issue in this case as indicated by Rick. You can influence the type of index created by using "USING BTREE" in the index creation (BTREE or HASH does not really seem to matter a lot in this case: You iterate through a range: Then BTREE is optimal. However you retrieve it per value, then HASH is optimal: Your query has both behaviours).

When you switch to INNODB, the caches will make the query probably just as quick as an in memory table. You do then have the benefit of an index. To guarantee BTREE indexing I would create the schema as follows:

CREATE TABLE `entries` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `weight` decimal(9,3) DEFAULT NULL,
  `fenwick` decimal(9,3) NOT NULL DEFAULT '0.000',
  PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;

CREATE UNIQUE INDEX idx_nn_1 ON entries (id,fenwick) USING BTREE;

This uses the idx_nn_1 index on the main calculation (And only the index: The whole table is not used at all since all data is in the index). The sample size of 100 records however is too small to give any definitive answer regarding performance. The time the index takes to build compared to the data being accessed by just using the table, can however be such that you do not have any performance gain at all. So the final answer will be in your test.

Other database engines (SQL Server, Oracle, Postgres): They will show similar behaviour. So switching to any of those engines will not make a huge difference except maybe for better processing in general (no way of predicting this).

SQL Server might be a bit better (=faster) in building the index since that would just use an unique index on id and include the fenwick value, thus not having to really index that value.

Oracle can really force the indexes, however this is not advised: In Oracle, assuming ordered data in the table, reading the table is quicker than reading the index and then the table for a lookup. Again in this scenario you can just add the id,fenwick index and never access the table. With index creation time taken into account, Oracle would have to read the full table once anyway, and in that time (or less depending on how many records it needs to reach your exit condition) it would also have executed your calculation.

wsdookadr
  • 2,584
  • 1
  • 21
  • 44
Norbert
  • 6,026
  • 3
  • 17
  • 40
  • `CREATE UNIQUE INDEX idx_nn_1 ON entries (id,fenwick) USING BTREE;` does not add much -- The `PRIMARY KEY(id)` is a `UNIQUE KEY`. The index will be only slightly smaller than the data. – Rick James Apr 07 '15 at 15:05
  • There is still the problem of getting the query to scan the data according to the desired order -- so that the @variables will add the right pieces. – Rick James Apr 07 '15 at 15:08
  • @RickJames : Right, the index is only slightly smaller, so I do not expect much of a difference except that MySQL can be really slow in full table scans compared to index scans (again: Case dependent as with many performance issues). – Norbert Apr 07 '15 at 16:37
  • @RickJames Assuming the Fenwick tree is built correctly, and we can prevent the query from performing a table scan, the order doesn't matter. The prefix sum of any given element is calculated from elements from a fixed set of keys. In the context of random entry selection, order doesn't matter for the naive linear query either. Is your concern the order of the entries in the table scan? – concat Apr 07 '15 at 16:44
  • The first query looked like it was summing up values until some threshold (@r), then it wanted to abort the table scan. I _assumed_ those needed to be ordered by smallest first (or something). (I am ignorant of Fenwick.) If it is ok to `SUM` a random subset of the rows, then much of what I have said is irrelevant. – Rick James Apr 07 '15 at 18:58
0

Is the Fenwick tree static enough to precompute some stuff? If so, I can give you a virtually O(1) solution:

  1. build a 2-column table (n, cumulative_sum)
  2. prepopulate the table: (1, 0.851), (2, 2.574), (3, 2.916), (4, 3.817), ...
  3. Create an index on the FLOAT

Then to lookup:

SELECT n FROM tbl WHERE cumulative_sum > 3.325
         ORDER BY cumulative_sum LIMIT 1;

If there is trouble with @variables, then have the stored procedure construct the SQL via CONCAT, PREPARE, and EXECUTE.

Addenda

Given that it is a periodic total replacement, compute the cumulative sums as you rebuild the table. My SELECT looks at only one row, so it is O(1) (ignoring the BTree lookup).

For "total replacement", I recommend:

CREATE TABLE new LIKE real;
load the data into `new`                -- this is the slowest step
RENAME TABLE real TO old, new TO real;  -- atomic and fast, so no "downtime"
DROP TABLE old;
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • The data is unfortunately not static, so the tradeoff for the slightly faster `O(lg n)` select with pre-computed tables is `Θ(n lg n)` time BTREE construction and worst-case `O(n lg n)` time update (vs. Fenwick's amortized `O(n)` and `O(lg n)` respectively) which is bound to happen often. – concat Aug 30 '15 at 04:06
  • How is the data updated? One random weight at a time? Replace the entire set periodically? Other? – Rick James Aug 30 '15 at 04:38
  • Regular, periodic total replacement. Entries are sometimes deleted (for which, with a Fenwick tree implementation, I would swap with the terminal entry, recalculate in `O(lg n)`, and pop) – concat Aug 30 '15 at 16:05
  • If there are lot more `SELECTs` than "replacements" and deletes, then it should be cost-effective to precompute the sums. (See Addenda) – Rick James Aug 31 '15 at 14:54
0

Just to add on to Marc's answer, for stored procedures or functions, where the list of indexes to sum over cannot be directly passed through the function arguments, we can generate the indexes within a query, and JOIN it to the summation query:

SELECT SUM(fenwick) FROM entries 
    CROSS JOIN (SELECT @n:=60) a 
    INNER JOIN (
          SELECT @n AS id, 0 
          UNION 
          SELECT 
              IF(@n&@bit<>0, @n:=@n-(@n&@bit), NULL) AS id,
              (@bit:=@bit<<1) 
          FROM entries 
          CROSS JOIN (SELECT @bit:=1) a 
          LIMIT 32
    ) dt ON dt.id=entries.id;

I expect the performance to be similar, and there is no longer a need for the client to generate the indexes.

concat
  • 3,107
  • 16
  • 30