52

If i'm using soft delete in laravel 4.2 (database is mysql), every eloquent query has WHERE deleted_at IS NULL. There are no indexes on deleted_at.

  • Will it be slow on big tables? (or maybe IS NULL is optimized without needing an index)
  • Should I add an index on the deleted_at column?

So, does Laravel's "soft_delete" deleted_at column need an index in MySQL?

Clarification: Laravel stores a timestamp in the deleted_at column to denote when a record has been soft deleted, as opposed to a boolean value.

kjones
  • 1,339
  • 1
  • 13
  • 28
rap-2-h
  • 30,204
  • 37
  • 167
  • 263
  • 2
    It doesn't need an index. If it's null, it isn't deleted. If it has a value, it is deleted. That means it has two possible values in order to work. Columns with two usable values are not good index candidates - their cardinality is converging to 0. Therefore, the deleted_at doesn't need an index. – N.B. Oct 23 '14 at 09:53
  • Ok, thanks. I thought almost every `where` clause needs to be indexed. You can post your comment as an answer if you want. – rap-2-h Oct 23 '14 at 13:29

3 Answers3

85

The column deleted_at is not a good index candidate. I'll try to explain better compared to the comment: indexes are useful only when their cardinality is relatively high. Cardinality is a number that describes index uniqueness in the data-set. That means it's total number of records divided by total unique records.

For example, the cardinality of primary key is 1. Every record contains unique value for primary key. 1 is also, the highest number. You can consider it as a "100%".

But, a column such as deleted_at doesn't have such a value. What Laravel does with deleted_at is check whether it is or isn't null. That means it has two possible values. Columns that contain two values have extremely low cardinality which decreases as number of records goes up.

You can index such a column, but it won't be of any help. What will happen is that it could slow things down and take up space.

TL;DR: no, you don't have to index that column, index will have no beneficial impact on performance.

N.B.
  • 13,688
  • 3
  • 45
  • 55
  • 4
    Does this mean there is no reason to index a boolean column, or am I misinterpreting? – Zane Mar 16 '16 at 02:37
  • 5
    @Cabloo - you're not misinterpreting, any column that has a small range of unique values is a bad index candidate, even if you have a `tinyint` whose values are `0` and `1`. – N.B. Mar 16 '16 at 12:25
  • 6
    @N.B. From [what I have been reading](http://stackoverflow.com/a/20504263/1190975), it is not about the number of unique values but rather it is about the distribution of values. So if 50% of records are true, it will not affect the query time. But if 5% of records are true, then it will decrease the query time. In other words, it's about "selectivity". In order for an index to be helpful, the search on that index must be relatively rare in the dataset. – Zane Mar 16 '16 at 17:34
  • 1
    @Cabloo we're talking about the same thing really. If the column has a small possible range of values, it means that its "distribution" as you called it - will suck. It doesn't matter whether it's 50% or 5% or 90%. If you have two possible values in data set with 50 million records, index doesn't help you reduce amount of records you have to look at. It's not even about the speed as much it's about wasted memory / storage space. In the long run, an index on low cardinality column isn't going to do much and it won't pay off to index such a column. – N.B. Mar 16 '16 at 19:02
  • @N.B. The link I mentioned in the last post says otherwise (distribution does matter), claiming they specifically tested results before and after adding an index on a boolean column and noticed a drastic increase in speed. – Zane Mar 17 '16 at 03:49
  • 3
    @Cabloo - the person in question mentions table that has only 4 million rows. I've observed 50 million row tables perform well without an index. That one case where they got performance out of indexing a boolean column isn't a sufficient proof. Imagine a table with 100 million records. Only one record contains a `true` value. So yes, doing a query such as `WHERE x = true` will be fast and you will deduce that indexing a boolean column is great. What about `WHERE x = false`? 2 divided by x, where `x > 0 and < infinity` tells you that you will waste space as your data grows. – N.B. Mar 17 '16 at 21:56
  • To conclude this discussion - yes, do index your boolean columns. Measure what happens. It *might* work for low count data sets, but you will trade a lot of memory for insignificant performance gains. At certain points, those indexes will prove as limiting factor. In the end, to each his own. I wish you all the best in your IT professional life :) – N.B. Mar 17 '16 at 21:59
  • @N.B. I decided not to index my columns since my use case was certainly not as specific as the post I linked (my use case is the exact same as the original question here). For the distribution discussion, I was imagining a scenario in which most of the rows in the table are deleted, and a few have `deleted_at = null`. I soon realized that if the most common queries are for `deleted_at = null` columns, AND very few results have `deleted_at = null`, an index would help but it would be much more beneficial to separate the structure into two tables. – Zane Mar 17 '16 at 22:33
  • 1
    It depends on the use-case, in my case, many of my tables check a range of deleted_at for certain reporting, in that case, indexing is necessary. – Szabi Zsoldos Nov 26 '18 at 15:13
  • 1
    @SzabiZsoldos you didn't read carefully. A **soft delete** is when Eloquent, by default, asks the database whether the field is or isn't null for a query such as *fetch all records that aren't deleted*. That's all. If you perform a query that requires a date range, that's an ENTIRELY different scenario and has nothing to do with this question or answer. – Furicane Nov 26 '18 at 22:01
  • 3
    @N.B. - If `deleted_at` is a datetime with lots of different values and rarely NULL, indexing it would be good. – Rick James Dec 13 '18 at 04:13
  • 1
    @ZaneHooper - The distribution _should_ be useful. But MySQL's optimizer has (historically, at least) been rather dumb -- just a single number for cardinality, not a histogram. – Rick James Dec 13 '18 at 04:14
  • @N.B. - MySQL won't use a single-column index on a boolean _unless_ you `FORCE INDEX`. That would greatly help for your example with `true`, and greatly _hurt_ for 'false`. Don't `FORCE`'. – Rick James Dec 13 '18 at 04:17
10

I don't know why the above by @N.B. has so many upvotes, in my context I find this completely incorrect.

I added indexes to deleted_at timestamps on some keys tables and enjoyed some queries dropping from 32s to less than 5.4ms. It really depends on the nature of of your app.

In my scenario, I have 3 tables with soft deletes, some simple joins (all with indexes), but my queries suffered due to the default nature of Laravel's handling of soft deletes.

I highly recommended indexing these columns so your app does choke when record numbers elevate.

Before migration. 32s query time. index1 index2 index3 After migration. 5.4ms

danrichards
  • 203
  • 3
  • 4
  • 1
    Both examples are suspect -- You have not shown what indexes there are, especially the ones that would be used for `JOINing` -- which would not be on `deleted_at`. And the `GROUP BY` factors in heavily. Please show both `SHOW CREATE TABLE` and `EXPLAIN SELECT ...` – Rick James Dec 13 '18 at 04:22
  • 1
    Dan, have you read the answer and tried to understand it? It seems that neither you or Rick James for that matter know how a b-tree works. Also, have you cleared all the caches before running the query? Did you make sure that `innodb_buffer_pool` contains the data? You made a completely false assumption here - first query, the 32sec one, hasn't got any data in buffer pool. Once executed, it fills it up. Your second query uses the in-memory data now. You falsely assumed it's because of indexing. – Mjh Dec 17 '18 at 13:45
  • Those are dubious queries -- They test for _any_ variant being NULL. That can be done much more efficiently with `EXISTS` and getting rid of the `GROUP BY`. – Rick James Dec 17 '18 at 15:37
  • @Mjh - The core point is whether the Optimizer will even use the BTree-organized index. Secondarily, if it must load the entire index (or table), then 5 seconds may be reasonable for the millions of cached records to look at. – Rick James Dec 17 '18 at 15:40
  • @RickJames the cardinality is included in the screenshot, 27006, 35449, 34037 for products, variants and skus respectively. My bad for not including the header specifying it. All the indexes are PRIMARY KEY or KEY for primary keys and foreign keys under concern. I had tried the queries multiple times, which I think would have allowed for the db to do any intelligent caching under the hood. I know what a binary tree is but I'm not an expert on innodb internals. There isn't any external caching / factors effecting the times. This is production data I imported into my local environment. – danrichards Dec 18 '18 at 22:49
  • 2
    I dropped the indexes, which had been running the query at 2.7ms and re-ran the query multiple times. I got 167s, 142s, and 151s. The `innodb_buffer_pool` is not empty, and I assume there was some optimizing as there was a slight increase? There is a a `deleted_at` IS NULL check on 3 tables in this query, if I remove the where clause for the IS NULL check, the query fires in 27.5ms without deleted_at indexes. So clearly those indexes bring a massive benefit when in the criteria. I don't know the consequence of adding too many deleted_at indexes, but I don't think I care after this experience. – danrichards Dec 18 '18 at 23:06
  • @danrichards - It gets complex. (1) `deleted_at` is rarely null -- index will be useful. (2) it is often NULL and the Optimizer _does not_ use the index -- potentially lengthy table scan. (3) often NULL and the Optimizer _does_ use the index -- could have costly bouncing back and forth between the BTree for the secondary index and the data BTree. `EXPLAIN SELECT ...` will _partially_ expllain what will happen. – Rick James Dec 19 '18 at 02:50
  • So if I got this right, what you did is this: you ran 2 queries, both which affect 15 rows ultimately, on a dataset of unknown size, on a hard disk of unknown specs. You see, if you make these assumptions blindly without taking care of what you run the test on (hardware), you'll conclude that indexing a 2-valued column is wise and you'll produce false positives, ultimately teaching other people **wrong**. After the DBA whose life you made worse beats you up to a pulp when the DB grows to billions of records, you'll return to what I wrote and you'll read it carefully :) – N.B. Dec 26 '19 at 16:16
  • @N.B. I'm not familiar with Laravel 4.2 but in newer versions `deleted_at` is a timestamp, not a boolean as your post states. So if `deleted_at` is a timestamp, (but even so the most common queries on it are still whether it is NULL or NOT NULL) what is the best practice for indexing? – chiliNUT Jan 06 '20 at 17:49
  • @chiliNUT my post didn't state it **is** a boolean. I wrote this: *What Laravel does with deleted_at is check whether it is or isn't null. That means it has two possible values.*. Laravel uses `WHERE created_at IS NULL` query. It checks for that. Just like I wrote in original post. Therefore, it does not matter what column type it is because the database will perform a simple **boolean** check. – N.B. Jan 08 '20 at 00:02
  • I can confirm that adding index to deleted_at column does increase query speed, before it was taking around ~25s, after adding index the query began taking around ~5-6s. The accepted answer is misleading, adding index where it's appropriate is always good than not using it at all. – eldorjon Jul 18 '22 at 12:13
7

Short Answer: Maybe.

Long Answer:

If there are very few different values in deleted_at, MySQL will not use INDEX(deleted_at).

If there are a not of different non-null dates in deleted_at, MySQL will use INDEX(deleted_at).

Most of the discussion (so far) has failed to bring the cardinality of this single-column index into consideration.

Note: This is not the same as a 2-value flag such as is_deleted. It is useless to have a single-column index on such.

More discussion (from MySQL point of view)

https://laravel.com/docs/5.2/eloquent#soft-deleting says

Now, when you call the delete method on the model, the deleted_at column will be set to the current date and time. And, when querying a model that uses soft deletes, the soft deleted models will automatically be excluded from all query results.

From that, I assume this is occurring in the table definition:

deleted_at  DATETIME  NULL  -- (or TIMESTAMP NULL)

And the value is initialized (explicitly or implicitly) to NULL.

Case 1: Lots of new rows, none yet 'deleted': All the deleted_at values are NULL. In this case, the Optimizer will shun INDEX(deleted_at) as not helping. In fact using the index would hurt because it would cost more to go through the entire index and the data. It would be cheaper to ignore the index and simply assume all rows are candidates for being SELECTed.

Case 2: A few rows (out of many) have been deleted: Now deleted_at has multiple values. Although Laravel only cares about IS NULL vs IS NOT NULL, MySQL sees it as a multi-valued column. But, since the test is for IS NULL and most rows are still NULL, the Optimizer's reaction is the same as for Case 1.

Case 3: A lot more rows are soft-deleted than still active: Now the index has suddenly become useful because only a small percentage of the table IS NULL.

There is no exact cutoff between Case 2 and Case 3. 20% is a handy Rule of Thumb.

Now, from the execution point of view.

INDEX(deleted_at) used for deleted_at IS NULL:

  1. Drill down the Index BTree for the first row with NULL.
  2. Scan until IS NULL fails.
  3. For each matching row, reach over into the data BTree to get the row.

INDEX(deleted_at) is not used:

  1. Scan the data BTree (or use some other index)
  2. For each data row, check that deleted_at IS NULL, else filter out that row.

Composite index:

It may be very beneficial to have a "composite" (multi-column) index starting with deleted_at. Example:

INDEX(deleted_at, foo)

WHERE deleted_at IS NULL
  AND foo BETWEEN 111 AND 222

This is very likely to use the index effectively regardless of what percentage of the table has deleted_at IS NULL.

  1. Drill down the Index BTree for the first row with NULL and foo >= 111.
  2. Scan until IS NULL or foo <= 222 fails.
  3. For each matching row, reach over into the data BTree to get the row.

Note that in an INDEX, NULL acts very much like any other single value. (And NULLs are stored before other values.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 2
    @Mjh - The effect of a MySQL optimization depends on the cardinality that is apparent to MySQL, not the _intent_ that Laravel gives it. If `deleted_at` actually has either `NULL` or lots of different `TIMESTAMPs`, then MySQL fils to see it as a boolean. Instead, it (falsely) assumes the number of different values are evenly distributed. – Rick James Jul 13 '19 at 04:25