3

My question is regarding the handling of MySQL index on VARCHAR combined with an int COLUMN when using prefix matching. e.g. if I have such a query:

SELECT * FROM tbl WHERE name LIKE 'query%' ORDER BY weight DESC LIMIT 5

Considering I have one index one name->weight, does that index need to find all apperances of the prefix query and then ORDER BY, or does he keeps the cross calculation indexed even with the use of prefix matching (%). I'm troubled by it, because for popular names (e.g. query=john) I might find myself searching for a long time for all appearances of john, and that will make the limit useless and the query to become slow as I'm dealing with a large dataset.

Noam
  • 3,341
  • 4
  • 35
  • 64

2 Answers2

1

You asked another question "Creating an Index that is best for wildcard search through 40Million names". Okay, you have 40 million records.

Now consider following formula:

x = COUNT(DISTINCT values in a column) / COUNT(values in a column)

An index on a column is much the better, the nearer x is to 1. If it is 1, all values are distinct, there are no duplicates and an index is therefore quite fast.

Now you are looking for 'john%'. That's 4 letters and an open end. Which letters is not important, your DB has to deal with 26*26*26*26=456976 distinct values. Put that in above formula and your 40 million records. You get an x of 0,0114244.

I don't know what's the threshold again, but IIRC it's 0,1 or something. So, if you're x is above 0,1 the index is used, if it's lower, it's not.

Why is that so? Using an index can even slow things down, cause your DB has to look at the index, see in that index on which position on your physical hard drive the appropriate record is and then get that record. Therefore, when x is below 10% it's faster just to do a whole table scan.

To summarize: Filtering 40 million records with only one weak index like yours is simply useless.

fancyPants
  • 50,732
  • 33
  • 89
  • 96
  • [How MySQL Optimizes `WHERE` Clauses](http://dev.mysql.com/doc/en/where-optimizations.html): "*Each table index is queried, and the best index is used unless the optimizer believes that it is more efficient to use a table scan. At one time, a scan was used based on whether the best index spanned more than 30% of the table, but a fixed percentage no longer determines the choice between using an index or a scan. The optimizer now is more complex and bases its estimate on additional factors such as table size, number of rows, and I/O block size.*" – eggyal Sep 06 '12 at 09:24
  • Thanks eggyal, so it was 0,3 and the information is a bit outdated. Anyway I think my point is still valid. – fancyPants Sep 06 '12 at 09:28
  • Agreed, I was just adding for clarification. :) – eggyal Sep 06 '12 at 09:31
  • @tombom what approach would you recommend? – Noam Sep 06 '12 at 10:04
  • Really hard to tell without having very much information about your schema, use cases and so on. Maybe lookup tables are an option with hash index. Easiest would of course be to filter the query as much as possible with an appropriate compound index, but all that is so much to ask and consider that you better ask a different question for that. – fancyPants Sep 06 '12 at 10:24
1

Provided that 'query' is of equal or shorter length than the indexed prefix of name:

  1. A composite BTREE index on (name, weight) will be ordered by name then weight. Conceptually:

    +---------+--------+---------+
    | name(7) | weight | address |
    +---------+--------+---------+
    | queryaa |    500 | 0x1.... |
    | queryaa |    500 | 0xe.... |
    | queryaa |    498 | 0x8.... |
    | queryaa |    491 | 0xb.... |
    | queryaa |    486 | 0xc.... |
    | queryaa |    430 | 0x3.... |
    | queryab |    600 | 0x2.... |
    | queryab |    592 | 0x7.... |
    | queryab |    550 | 0x4.... |
    | queryab |    321 | 0xa.... |
    | queryab |    321 | 0x6.... |
    | queryab |    304 | 0x9.... |
    | queryab |    297 | 0x5.... |
    | querybc |    800 | 0xd.... |
    :         :        :         :
    
  2. MySQL can very quickly traverse such an index to find the top 5 weights for each indexed prefix within the range defined by the filter name LIKE 'query%' (I'm not certain that it does this step, but I'd be surprised if it did not):

    +---------+--------+---------+
    | name(7) | weight | address |
    +---------+--------+---------+
    | queryaa |    500 | 0x1.... |
    | queryaa |    500 | 0xe.... |
    | queryaa |    498 | 0x8.... |
    | queryaa |    491 | 0xb.... |
    | queryaa |    486 | 0xc.... |
    | queryab |    600 | 0x2.... |
    | queryab |    592 | 0x7.... |
    | queryab |    550 | 0x4.... |
    | queryab |    321 | 0xa.... |
    | queryab |    321 | 0x6.... |
    | querybc |    800 | 0xd.... |
    :         :        :         :
    
  3. At this point, MySQL must perform a filesort on the results:

    +---------+--------+---------+
    | name(7) | weight | address |
    +---------+--------+---------+
    | querybc |    800 | 0xd.... |
    | queryab |    600 | 0x2.... |
    | queryab |    592 | 0x7.... |
    | queryab |    550 | 0x4.... |
    | queryaa |    500 | 0x1.... |
    | queryaa |    500 | 0xe.... |
    | queryaa |    498 | 0x8.... |
    | queryaa |    491 | 0xb.... |
    | queryaa |    486 | 0xc.... |
    | queryab |    321 | 0xa.... |
    | queryab |    321 | 0x6.... |
    :         :        :         :
    
  4. And only then can it use the top 5 results to fetch the associated records from the table:

    +---------+--------+---------+
    | name(7) | weight | address |
    +---------+--------+---------+
    | querybc |    800 | 0xd.... |  --> fetch from table
    | queryab |    600 | 0x2.... |  --> fetch from table
    | queryab |    592 | 0x7.... |  --> fetch from table
    | queryab |    550 | 0x4.... |  --> fetch from table
    | queryaa |    500 | 0x1.... |  --> fetch from table
    +---------+--------+---------+
    

If 'query' is longer than the indexed prefix of name, then MySQL must perform lookups into the table in step 1 above in order to adequately filter the records which are subsequently ordered.

eggyal
  • 122,705
  • 18
  • 212
  • 237
  • So you're actually saying it'll need to find all apperances of words matching the query, making queries like 'john%' ORDER BY weight LIMIT 5 really hard for him? – Noam Sep 06 '12 at 16:41
  • @Noam: I'm saying it depends on the length of the indexed prefix. – eggyal Sep 07 '12 at 11:10