154

I'm just about to write a query that includes a WHERE isok=1. As the name implies, isok is a boolean field (actually a TINYINT(1) UNSIGNED that is set to 0 or 1 as needed).

Is there any performance gain in indexing this field? Would the engine (InnoDB in this case) perform better or worse looking up the index?

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592

9 Answers9

180

Just to put a finer point on several other answers here, since in my experience, those looking at questions like this are in the same boat we were, we've all heard that indexing Boolean fields is pointless, and yet...

We have a table with about 4 million rows, only about 1000 or so at a time will have a Boolean switch flagged and that's what we search against. Adding an index on our Boolean field sped up queries by orders of magnitude, it went from about 9+ seconds to a fraction of a second.

oucil
  • 4,211
  • 2
  • 37
  • 53
  • Yes, while you should definitively try to understand the 'why' of things, always measure alongside and try out different things on your actual dataset to see whether your theory matches up with the db engine's actual behavior (you'd be surprised...) – Eelco Mar 09 '15 at 14:10
  • 11
    @Eelco You are right, but in this case, the result actually matches up with the basic theory well. The basic idea that it should be negligible only makes sense if you are about 50% likely to come across items matching your search. Then, to find 100 matches, the DB needs to iterate 200 items. But if the items only match 1% of the time, it would need to iterate 10,000 items. – mahemoff Sep 28 '15 at 11:10
  • `WHERE my_col > 0 ` instead of `my_col = 1` also seems to help speed – Aaron Sep 22 '19 at 03:26
  • @Aaron how, why? – Iulius Curt Feb 09 '23 at 12:32
111

Not really. You should think about it like a book. If there were only 3 kinds of words in a book and you index all of them, you would have the same number of index pages as normal pages.

There would be a performance gain if there are relatively few records of one value. For example, if you have 1000 records and 10 of them are TRUE, then it would be useful if you searching with isok = 1

As Michael Durrant mentioned, it also makes writes slower.

EDIT: Possible duplication: Indexing boolean fields

Here it explains that even if you have an index, if you have too many records it doesn't use the index anyways. MySQL not using index when checking = 1 , but using it with = 0

Community
  • 1
  • 1
Michael Koper
  • 9,586
  • 7
  • 45
  • 59
  • 5
    This is not entirely correct, without an index mySql needs to scan the whole table to find the relevant rows. – ilanco May 09 '12 at 22:04
  • 5
    otherwise it would scan the whole index. (which is just as long in most cases) – Michael Koper May 09 '12 at 22:06
  • If the boolean is indexed, it'll be stored in memory, whereas if it's not indexed you have to go to disk which is slower. – Ed Massey Nov 18 '14 at 17:35
  • 2
    It can make a difference. Just cut the execution time in half of a query just by adding an index, and writes are rare and cheap enough that we don't really care about the penalty. As with everything, don't assume, measure (also because databases don't always actually behave like you'd logically expect them to) – Eelco Mar 09 '15 at 14:08
  • 6
    This assumes equal distribution between TRUE and FALSE. As mentioned by @oucil below, if you are looking for a boolean value which is fairly rare, it could still take a while. Not saying you should always index, but I would assume the nature of your data and your queries also matters under most database engines. – mahemoff Sep 28 '15 at 11:12
  • 2
    @EdMassey - No, the location in RAM vs Disk is not that simple. All blocks (either data or index) are "cached" as needed in the buffer_pool. So any one block may, or may not, be in memory. – Rick James Dec 13 '18 at 04:41
  • I thinks it's really depends on what you are tring to do. and how db store the data and How the db are going to excute the sql (which varies between dbs and the config choose.). Can't say it's totally unuseful. –  Jan 21 '20 at 06:55
41

It depends on the actual queries and the selectivity of the index/query combination.

Case A: condition WHERE isok = 1 and nothing else there:

SELECT *
FROM tableX
WHERE isok = 1
  • If the index is selective enough (say you have 1M rows and only 1k have isok = 1), then the SQL engine will probably use the index and be faster than without it.

  • If the index is not selective enough (say you have 1M rows and more than 100k have isok = 1), then the SQL engine will probably not use the index and do a table scan.

Case B: condition WHERE isok = 1 and more stuff:

SELECT *
FROM tableX
WHERE isok = 1
  AND another_column = 17

Then, it depends on what other indexes you have. An index on another_column would probably be more selective than the index on isok which has only two possible values. An index on (another_column, isok) or (isok, another_column) would be even better.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
19

It depends on the distribution of the data.

Imagine I had a book with 1000 closely typed pages, and the only words in my book were 'yes' and 'no' repeated over and over and distributed randomly. If I was asked to circle all the instances of 'yes', would an index in the back of the book help? It depends.

If there was a half-and-half random distribution of yes's and no's, then looking up in the index wouldn't help. The index would make the book a lot bigger, and anyway I'd be quicker just to start from the front and work my way through each page looking for all the instances of 'yes' and circling them, rather than looking up each item in the index and then taking the reference from the index entry to the page that it refers to.

But if there were, say, just ten instances of 'yes' in my thousand-page book and everything else was just millions of no's, then an index would save me loads of time in finding those ten instances of 'yes' and circling them.

It's the same in databases. If it's a 50:50 distribution, then an index isn't going to help - the database engine is better off just ploughing through the data from start to finish (full table scan), and the index would just make the database bigger, and slower to write and update. But if it is something like a 4000:1 distribution (as per oucil in this thread), then an index seek can speed it up hugely, if it is the 1 in 4000 items that you are looking for.

Jinlye
  • 1,774
  • 18
  • 19
  • -1 because I'm not persuaded. If I were given a book with milion pages, each page would contain a single word, yes or no 50/50 distribution, I might still benefit from comparing it with a sorted list of pages that I can keep skiping. But generally I'd say it's hard to predict your distribution anyway, so unless the index is a mistake I'd feel like putting it in before getting into trouble. The main question is whether we filter by the column or not, I'd leave the distribution type of logic on the DB engine if it decides to use the index or not. – Daniel Katz Jun 24 '22 at 15:30
  • Most real-world data have predictable distribution. Coin tosses are 50:50; aircraft landings resulting in total hull loss are one in a million. Indices are not free - they have computational and storage expenses, which may or may not be important in a particular application. Is an index right for my boolean field in my data on my database hosted on my platform? It depends... – Jinlye Jun 25 '22 at 20:03
  • @DanielKatz - You might consider reading it again. It is very persuasive. You are right that a 50:50 distribution won't matter. But in a 1000000:1 distribution, finding that 1 in a million is where the index helps. – OCDev Nov 25 '22 at 14:25
5

No, usually not.

You usually index fields for searching when they have high selectivity/cardinality. A boolean field's cardinality is very low in most tables. It would also make your writes fractionally slower.

Michael Durrant
  • 93,410
  • 97
  • 333
  • 497
3

Actually this depends on queries you run. But, generally yes, as well as indexing a field of any other type.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
1

Yes, For two values let's say cardinality is 1:100 i.e for 1 true there are 100 false columns then gain would be significant. In one of my table, the ratio was 1:10000 and table was bloated i.e lot of large jsons and HTML(64KB) was present in column, The select * query with index brought query from 10sec to 1 ms for data with cardinality 1.

NIshank
  • 97
  • 6
-1

Yes an index will improve performance, check the output of EXPLAIN with and without the index.

From the docs:

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.

I think it's also safe to say an index will not DECREASE performance in this case, so you have only to gain from it.

ilanco
  • 9,581
  • 4
  • 32
  • 37
  • 2
    An index gives a lot of data on the harddisk and it makes writes slower so you dont only gain from it. – Michael Koper May 09 '12 at 22:10
  • 1
    True, but in this case, a `TINYINT(1) UNSIGNED` column, the size of the data will be small. – ilanco May 09 '12 at 22:13
  • And the added write overhead probably pretty low – Eelco Mar 09 '15 at 14:11
  • 1
    Isn't the size of the index going to grow with the number of rows it points to, not just the size of the indexed field? – poolie May 08 '16 at 21:07
  • This is wrong and poster did not check. MySQL does not use index for booleans (unless only few rows are queried, I thought I read somewhere < 30%, but in my test case I had 27% queried rows, but it did not use index). Example, I have a index on a boolean field and this is the output from "explain": id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | | 1 | SIMPLE | part_numbers | ALL | NULL | NULL | NULL | NULL | 586275 | Using where (You see no index was used) – Fallenhero Oct 20 '22 at 08:23
-3

Index is just a map. It is O(1) to retrieve all rows having is_xxx set to true. W/o it you would need to scan the entire table and check this predicate against every single row in it which is O(n)