-2

This is a question from leetcode, using the second query I got the question wrong but could not identify why

SELECT
    user_id,
    max(time_stamp) as "last_stamp"
from
    logins
where
    year(time_stamp) = '2020'
group by
    user_id

and

select
    user_id,
    max(time_stamp) as "last_stamp"
from
    logins
where
    time_stamp between '2020-01-01' and '2020-12-31'
group by
    user_id
Dai
  • 141,631
  • 28
  • 261
  • 374
  • What is the **exact type** of the `time_stamp` column? – Dai Jul 11 '22 at 03:38
  • Tip: `WHERE YEAR(time_stamp) = '2020'` should be `WHERE YEAR(time_stamp) = 2020` because the `YEAR()` function returns an `int` value, not a string. The query works because `'2020'` can be implicitly converted to an `int`, but it's still suboptimal and a misbehaving query-planner might produce a non-sargable predicate (by converting `YEAR()`'s return-value to a string for every row instead of converting `'2020'` to an `int` only _once_. – Dai Jul 11 '22 at 03:40
  • Hi @Dai the time_stamp column is in datetime. – Marcus Chan Jul 11 '22 at 03:42
  • It's `datetime` and _not_ the `TIMESTAMP` type? Curious... What version of MySQL are you using? – Dai Jul 11 '22 at 03:43
  • Thank you for your reply! However, I'm still not sure why the second query does not work – Marcus Chan Jul 11 '22 at 03:43
  • I'm not sure as I'm doing this on leetcode – Marcus Chan Jul 11 '22 at 03:43
  • 1
    I think you need to define 'does not work'. Exactly what does leetcode say is wrong here? – Tangentially Perpendicular Jul 11 '22 at 03:44
  • 1
    The reason is because `2020-12-31` is a **date** literal, not a datetime literal, but its converted to datetime with a zero time-of-day component, and `BETWEEN` is inclusive - so your second query currently excludes all values _after_ midnight on `2020-12-31` and before `2021-01-01 00:00:00`. – Dai Jul 11 '22 at 03:45
  • To produce identical results, the 2nd should be `time_stamp between '2020-01-01 00:00:00' and '2020-12-31 23:59:59'` – kmoser Jul 11 '22 at 03:45
  • @kmoser That will still exclude events at `2020-12-31 23:59:59.001` - it's better and simpler to use `< '2021-01-01 00:00` instead. – Dai Jul 11 '22 at 03:46
  • @Dai I thought `DATETIME` columns only stored accuracy to one second by default. To store fractional second accuracy, that needs to be [specified in the schema](https://dev.mysql.com/doc/refman/5.7/en/date-and-time-type-syntax.html). So my method should work if it's a default `DATETIME`, otherwise you're right. In either case, your method is better because it works regardless of second accuracy :) – kmoser Jul 11 '22 at 03:50
  • 2
    @kmoser Yes, buuuut if the column type were to change to one with a higher-precision then the query would break, but by using `'2020-01-01' <= time_stamp AND time_stamp < '2021-01-01'` protects the query from schema/type-changes _and_ is automatically sargable too, so should have overall better performance. – Dai Jul 11 '22 at 03:53
  • One more difference - 2nd query may use index by `time_stamp` rather than 1st query. – Akina Jul 11 '22 at 04:42

1 Answers1

1

The first query uses a function on every row to extract the year (an integer) and compares that to a string. (It would be preferable to use an integer instead.) Whilst this may be sub-optimal, this query would accurately locate all rows that fall into the year 2020.

The second query could fail to locate all rows that fall into 2020. Here it is important to remember that days have a 24 hour duration, and that each day starts at midnight and concludes at midnight 24 hours later. That is; a day does have a start point (midnight) and an end-point (midnight+24 hours).

However a single date used in SQL code cannot be both the start-point and the end-point of the same day, so every date in SQL represents only the start-point. Also note here, that between does NOT magically change the second given date into "the end of that day" - it simply cannot (and does not) do that.

So, when you use time_stamp between '2020-01-01' and '2020-12-31' you need to think of it as meaning "from the start of 2020-01-01 up to and including the start of 2020-12-31". Hence, this excludes the 24 hours duration of 2020-12-31.

The safest way to deal with this is to NOT use between at all, instead write just a few characters more code which will be accurate regardless of the time precision used by any date/datetime/timestamp column:

where
    time_stamp >= '2020-01-01' and time_stamp <'2021-01-01' 

with the second date being "the start-point of the next day"

See answer to SQL "between" not inclusive

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51