1

The following question is framed with a particular lean towards MySQL and PostgreSQL, but I'd also be interested in answers regarding other database systems.

I'm designing a database and the SET column type appears to fit the bill in a few cases. One such example could be expressed as a boolean column for each day of the week, and I'm thinking of instead using MySQL's SET, SET('Sun','Mon','Tue','Wed','Thu','Fri','Sat').

Is an index on such a SET column useful? Would it speed up searches for rows matching individual days of the week? Particular combinations of days of the week? Or would it only speed up searches for full exact binary values of the field (such as 0101010 for Mon/Wed/Fri)?

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
tremby
  • 9,541
  • 4
  • 55
  • 74
  • 2
    It's better to ask about one database than several as often they have quirks that steer you towards a particular solution. Indexes are useful. `SET` values not so much. Why not just a `TINYINT` that represents the day rather than some string expression? – tadman Apr 10 '17 at 01:05
  • Whether or not an index is useful depends on the types of queries you're doing and on the distribution of values in the column. For example, if you column is 90% `'Mon'` then an index probably won't be that useful (unless all you're queries are `c <> 'Mon'`). Many "is an index useful?" questions tend to be answered by looking at query plans and your actual data. – mu is too short Apr 10 '17 at 01:24
  • @tadman, a `TINYINT` representing a day would not be useful. Seven `TINYINT` columns, perhaps, and I specifically mentioned this in my question. But isn't this exactly what `SET` is for -- to provide a way to represent a set of related booleans? – tremby Apr 10 '17 at 01:45
  • You can always use bitmasks for that, but Postgres `ARRAY` columns work very well for the same thing and are nicely indexed. MySQL isn't as smart at optimizing. – tadman Apr 10 '17 at 01:48
  • @muistooshort, I gave three examples. 1: queries for rows where (at least) 'Mon' is true. 2: queries for rows where (at least) 'Mon' AND 'Tue' are true. 3: queries for an exact set of values of all seven days such as 'Mon','Wed','Fri' but not 'Sun', 'Tue', 'Thu', or 'Sat'. As for distribution of values in the column, I expect there to be a lot with a single day true (mostly either 'Sat' or 'Sun' or 'Mon'), and a lot of rows with one of those plus another day late in the week, and a scattering of rows with all seven days true. – tremby Apr 10 '17 at 01:49
  • @tadman, I see. A Postgres `ARRAY` wouldn't work like a real set, though, would it? Wouldn't `['Mon','Mon']` be a valid value there, at the database level? – tremby Apr 10 '17 at 01:50
  • You need to walk the line between being overly paranoid and being performant. Evan's answer is what I'd suggest. Short identifiers are longer than things like `Mon`. `SUMTWHF` is an example of single-letter identifiers, but `0123456` also works so long as you have consistent conventions. – tadman Apr 10 '17 at 01:52
  • @tadman, I don't know what you're getting at when you talk about identifiers. I wasn't asking about those. – tremby Apr 10 '17 at 02:41

2 Answers2

2

Using PostgreSQL

Logically, if you wanted to only test for = the binary solution is the fastest. But, that's not to useful.

If not, you're probably better storing them as

  1. an array of enum,
  2. just simply as individual boolean fields. You can even use a bloom index.

In PostgreSQL you can create an enum type and then have an array of enum types. An index will speed this up.

CREATE TYPE dow AS ENUM ('M', 'Tu', 'W', 'Th', 'F', 'Sa', 'Su' );
CREATE TABLE foo ( days dow[] );

This would permit you to find all available Mondays with

SELECT * FROM foo WHERE days @> ARRAY['M']::dow[];

Or, all Monday, Wednesday, and Friday

SELECT * FROM foo WHERE days @> ARRAY['M','W','F']::dow[];

Or you could make them bools, index them, and then do

SELECT * FROM foo WHERE has_monday AND has_wednesday AND has_friday; 
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
  • Very interesting. I hadn't heard of a [Bloom index](https://www.postgresql.org/docs/9.6/static/bloom.html) before, and read about it. Would you go with the array or with the seven booleans? I have a mild aversion to the array approach: a distaste for the idea of it being possible to have `['Mon', 'Mon']` in here. Finally, it frustrates me that though this is exactly what `SET` is intended for (to my eyes), it doesn't appear to actually be a good solution! – tremby Apr 10 '17 at 02:06
  • I'm not even sure what `SET()` is. It's not in the SQL spec at all. Seven booleans sounds good though to me. Bloom apparently doesn't support booleans. So you may want to either do 7 btrees, or do one compound btree on all seven and be sure to include all days individually when you query for them. – Evan Carroll Apr 10 '17 at 02:12
  • Can you explain what you mean by "include all days individually when you query them"? `SET` is a bitmap, with each bit representing one of the booleans, and in [MySQL](https://dev.mysql.com/doc/refman/5.7/en/set.html) it is represented when queried on and retrieved as a comma-separated list of strings. I didn't realize until now that PostgreSQL doesn't actually have such a type. – tremby Apr 10 '17 at 02:31
  • Could you drop the link to the docs to MySQL. I have no idea what you're talking about. – Evan Carroll Apr 10 '17 at 02:44
  • I did! I just linked the word "MySQL"; perhaps I should have made it more obvious. But my question in the first part of my comment was not to do with that, only to do with your previous comment. – tremby Apr 10 '17 at 02:46
0

MySQL A SET is implemented as an INT UNSIGNED of some length up to 8 bytes (64 items). The comments on the reference manual have lots of examples. Included are examples of how to treat the SET as the bits it is composed of.

Just as you cannot index "parts" of a number, you cannot really index parts of a SET.

SET('Sun','Mon','Tue','Wed','Thu','Fri','Sat') is notational convenience for a 7-bit number. And using 'Mon,Wed,Fri' to set 3 of the bits is debatably also a notitional convenience. Turning off a bit is really messy unless you think in bits and INTs and powers of 2.

If you don't already understand how binary numbers are composed of bits, then you will probably find SETs to be very hard to use.

There is one case where an INDEX might be worth having -- "covering". That is, an index that contains all the columns mentioned in a SELECT will probably run that SELECT faster. Example:

SELECT item FROM tbl WHERE FIND_IN_SET('Mon', my_set);
-- together with
INDEX(my_set, item)

That index will probably speed up finding the items that include Mondays. Scanning the "covering" index is likely to be faster than scanning the table.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I don't find the concepts of bitmaps and bitmasks hard to understand at all. That's not what the question was about. "Turning off a bit is really messy unless you think in bits and INTs and powers of 2" -- this really isn't hard. `update workers set workdays=workdays&~pow(2, 3) where id=1;` will switch off wednesday (where wednesday is index 3, as when starting from sunday). – tremby Apr 10 '17 at 21:37
  • It would be nice, however, if there were an easy way to get from 'Wed' to 3. (`update workers set workdays=workdays&~pow(2, find_in_set('Wed', 'Sun,Mon,Tue,Wed,Thu,Fri,Sat')-1) where id=1;` is not nice!) – tremby Apr 10 '17 at 21:44
  • 1
    Instead of `pow()`, use `<<` operator. – Rick James Apr 11 '17 at 05:20
  • `<<` wouldn't be a drop-in replacement. Sunday would be `pow(2, 0)` which is 1, but `2<<0` is 2, and in MySQL `2<<-1` is 0. What would be your suggested logic here? – tremby Apr 11 '17 at 21:09
  • Oh -- `1<<0`, `1<<1`, etc. Silly me. – tremby Apr 11 '17 at 21:11