1

Let's say I have a table with 10 billion rows. Each row has a datetime column to signify when the row was added to the table. When querying against this table, 99% of the time the query will be based on a date range. Now my question:

Would it be faster to query based on the date time (e.g. where date between), or to query based on a range of the primary key (rowid)? In the second case I could have a table containing ID range values for each date so it would be simple to know what ID range to query for. Regardless of the implementation, would a primary key query like this be faster than the date range query? And if so, would it be significantly faster/take significantly less resources?

Thanks for any help. I'll be running some tests myself as well, just want to get some outside feedback on the matter to avoid confirmation bias.

Branden
  • 237
  • 4
  • 13
  • 1
    Being a 'date' is irrelevant. Being not in an index is very relevant. A primary key (usually) is implicitly an index. See http://stackoverflow.com/questions/1108/how-does-database-indexing-work and https://en.wikipedia.org/wiki/Sargable – user2864740 Aug 02 '15 at 07:23
  • Also, "In the second case I could have a table containing ID range values for each date so it would be simple to know what ID range to query for" is pretty much how a [non-covering] index works - only the database manages all the details including ensuring a good query plan is generated. – user2864740 Aug 02 '15 at 07:28
  • Thank you for your feedback! I had a look at the material you linked and think I get the gist of the use of indexes. Would implementing them be as simple as indexing the datetime field? Or are there further steps I would need to implement to really take advantage of indexes? Also your second comment, is there a way for the database to take care of that kind of index on its own? I planned on basically running a cron job daily to get each day's range of IDs. (which sounds like bad practice to me, but I'm honestly not sure) – Branden Aug 02 '15 at 07:44
  • 1
    Just add an index to the appropriate (ie. 'date') column. Sometimes it is useful to make *covering indexes* (that span multiple columns), but having *any* index is usually a good start. When running a query you can then look at the query plan (see EXPLAIN) to see which indexes, if any, the database decided to use. – user2864740 Aug 02 '15 at 07:47
  • Thanks again! I added an index to a few columns that I query against and then ran an explain on a common query I run. I see the columns I added an index to under 'possible_keys' and null for 'key'. Does that mean it is using the indexes properly? Also, could you throw out an example of when a covering index would be used? I appreciate all the help, moving forward with this large database I'd like to keep everything as efficient as possible so definitely want to learn all the ins and outs of things like indexing. – Branden Aug 02 '15 at 07:58
  • I don't know how to read the MySQL output - I use SQL Server :} A covering index would be one like `(the_date, some_service)`. This effectively adds two things: one the index can cover by multiple axis and, two *if* the index is used for `the_date` then the `some_service` data 'comes for free' without needing to look into the primary record (that is, the data is copied as part of the index itself). – user2864740 Aug 02 '15 at 08:01
  • I see.. so say, for example, I have 2 columns that are often queried against in the table - date and userid. I need to query by both, for a date range that belongs to userid. Would it be a good idea to add a covering index for these two columns? Furthermore, if I have say 3-4 columns that are often queried against, would it be possible to index more than just 2 columns together? – Branden Aug 02 '15 at 08:05
  • Yes, indexes can contain many columns (there is additional associated cost with maintaining them); as long as the 'leftmost columns' of an index are used the columns to the right are available. If/when indices are created with multiple columns (and what columns should be covered or included) depends a good bit on the actual queries used - and if the query planner is capable of using the given indexes efficiently. – user2864740 Aug 02 '15 at 08:17
  • Perfect, I think I'm grasping the idea. Another question, say I add an index on the 2 columns date and userid, would it be fine to have indexes for the individual columns as well? (e.g. the covering index, then a separate index for date, a separate index for userid) Or would that be entirely redundant? – Branden Aug 02 '15 at 08:34
  • If there is an index `(x, y)` then an index `(x)` is redundant (because it is a part of the other index, as read from the left). However an index `(y)` is not redundant (because it has no dependence on `x`), and it may be potential useful (eg. the query *only* searches on `y`). Since updating/maintaining (and even choosing) indices takes some work indices should only be created as they are used/useful. – user2864740 Aug 02 '15 at 08:43
  • Thanks so much for your help. I'll be researching further into this in order to decide how to best use indexing for my needs. Thanks again! – Branden Aug 02 '15 at 08:59

0 Answers0