1

I am working on an application that needs to run SQL queries to see whether certain locations fall into jurisdictions which are defined by polygons. These jurisdictions are updated quarterly, meaning I also need to check them against whatever date I'm checking it on. I'm using the mySQL function MBRContains to help narrow down the jurisdictions I'm looking at, making it easier for my application to then check if the given point lands in any of them. Right now, my queries look like this:

SELECT DISTINCT t0.id FROM jurisdiction t0
WHERE t0.beginDate <=  '2017-08-05' AND t0.endDate >= '2017-08-05'
AND MBRContains(t0.geometry,GeomFromText("POINT(48.0 -120.0)"));

I'm fairly sure that MBRContains is a more expensive operation than checking the date, but I haven't worked with SQL much and I'm unsure how to make it so the jurisdictions will be filtered by date first, then checked against MBRContains. How can I do this? Are there other optimizations I could do while achieving the same results with this query?

user3726962
  • 333
  • 1
  • 4
  • 17
  • Not sure if it is a typo, but those are bad dates. For one the date fields should be of the `DATE` data type; for another, even if you stick with string values, the strings need to be formatted in a way that those comparisons actually make sense (an end date of '06-08-2001' is after '05-08-2017', similarly a begin date of '04-08-2020' is before '05-08-2017') – Uueerdo Sep 08 '17 at 19:01
  • @Uueerdo The dates and the point are examples. I've checked that it's passing in the right dates and filtering the results correctly, it's just that the queries take too long. – user3726962 Sep 08 '17 at 19:04
  • If you use indexes on beginDate, endDate and geometry it will accelerate your access. – hackela Sep 08 '17 at 19:05
  • 1
    @hackela an index on geometry is unlikely to help since condition is actually on the result of a function call to it. – Uueerdo Sep 08 '17 at 19:06
  • Any SQL optimization question should include `SHOW CREATE TABLE` for your referenced table, and `EXPLAIN` for your query to optimize. – Bill Karwin Sep 08 '17 at 19:07
  • @user3726962 if the real date values used are not in "YYYY-MM-DD" format, the results will not be filtered correctly. – Uueerdo Sep 08 '17 at 19:07
  • @Uueerdo They are in this format in the table and in the queries, I just made a typo when choosing an example in my question. I'll update it to avoid confusion. – user3726962 Sep 08 '17 at 19:08
  • "filtered by date first, then checked against MBRContains" - order of conditions in your query is correct. If the query take too long, you can try indexes on beginDate and endDate like @hackela suggested. – cuong hoang Sep 08 '17 at 19:08
  • What is the query plan (estimated and/or actual) telling you? – pmbAustin Sep 08 '17 at 19:09
  • 1
    @cuonghoang - The optimizer does not look at the order of `ANDs`; it will reorder as it sees fit. – Rick James Sep 08 '17 at 20:21

3 Answers3

2

I would add an index on (beginDate, endDate). One index, on both fields, not separate indexes.

Additionally, if the date supplied is always a single date, changing the date part of the condition to '2017-08-05' BETWEEN t0.beginDate AND t0.endDate might help further (but probably not).

Also, with similar questions in the past I've suggested using additional conditions with simpler bounds checking that can also benefit from indexes; but those usually involved situations where a larger bounding box could be used. The pseudo condition would be like "WHERE in bounding box AND in bounding area"; "in bounding box" eliminating far out points using an index, reducing the number of points that the more complicated "in bounding area" is checked against.

Uueerdo
  • 15,723
  • 1
  • 16
  • 21
1

If MBRContains are GeomFromText are deterministic functions and not flagged as such, then doing so may help.

A deterministic function is one that always returns the same value with the same input. UPPER() is deterministic because it always gives you the same output for the same input. This means the optimizer can make shortcuts if it knows that an intermediary value will never change.

For more about deterministic functions:

(Now that I search around, I see that those functions are supplied as part of the database, not ones you wrote, so you won't have control over them. Still, I'm leaving the answer here as an example of how to potentially speed up SQL that calls functions.)

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • No. `DATE(col)` is deterministic, but the Optimizer will _not_ use `INDEX(col)` for `WHERE DATE(col) = CURDATE()`. Nor any other shortcuts. It punts. (OK, it will evaluate `CURDATE` only once.) – Rick James Sep 08 '17 at 20:23
1

Checking a range like you are doing is something that does not Optimize well. The best you can get is to scan half the table. And that is problematic.

Effectively you are stuck with scan. Then the question of what part of the WHERE will evaluated first turns out to be minor. This is because fetching a row is a lot more costly than almost any function that might be in the WHERE clause.

Did you try a SPATIAL index on geometry? That may help significantly. If so, the rest of this Answer is moot.

As already mentioned, a "bounding box" is a good first attempt:

WHERE x BETWEEN ...
  AND y BETWEEN ...

INDEX(x),
INDEX(y)

(No, INDEX(x,y) does not work any better.)

If you are talking about a huge dataset, then you might need a more complex solution.

Rick James
  • 135,179
  • 13
  • 127
  • 222