MySQL Query performance issue using group by

Question

I have the following SQL. It's taking about 95 seconds to execute. There are approx 25 million records in the table.

SET @lat=(select latitude from skoovy_prd.pins where user_id=0 and board_id=0 limit 1);
SET @lng=(select longitude from skoovy_prd.pins where user_id=0 and board_id=0 limit 1);
SELECT category_id, MAX(pin_id), pin_id
FROM skoovy_prd.pins
WHERE ( 3959 * acos( cos( radians(@lat) ) * cos( radians( latitude ) ) 
* cos( radians( longitude ) - radians(@lng) ) + sin( radians(@lat) ) * sin(radians(latitude)) ) ) <=25
GROUP BY category_id DESC
LIMIT 12;

category_id, latitude, longitude, pin_id are all BTREE indexes.

Is there a more efficient way to write this so I can get records back much faster? The purpose of this is to get me a record set of data where each record is a distinct category. I got the sql here after posting this question: mysql selecting records but ensuring data in one column is distinct of which it was marked as a duplicate of Retrieving the last record in each group

There was a solution provided by newtlover in the list of answers which led me to the sql I have written and posted here. (Even though I'm not really looking for the last record in each group, it's at least getting me records where the category_id is distinct in the recordset.

I'm hoping there's a way to improve performance on this query. And if anyone has any suggestions to get around the whole last record in each group, that'd also be appreciated. I am NOT a SQL person by any means, so I'm grasping at straws here.

With or without DESC is the same performance-wise and produces acceptable correct results — kambythet, Sep 19 '14 at 22:50
You probably get faster results when using `SELECT DISTINCT category_id, pin_id` without grouping. I'm not sure whether that provides the result you expect, but the query can be finished after 12 hits unlike the `GROUP BY`version, where all records with the found category_ids need to be considered for `MAX(pin_id)`. — andy, Sep 19 '14 at 23:01
I tried that and MySQL's DISTINCT doesn't offer support to get the results as expected. I get records that contain duplicate category ids — kambythet, Sep 19 '14 at 23:08

score 2 · Accepted Answer · answered Sep 19 '14 at 23:10

You can't expect an SQL expression to make use of an index if you referenced the indexed columns deeply within expressions. That spoils the use of indexes, because the optimizer has no way of knowing if the result of the expression has the same sort order as the order of the index.

Distance formulas are especially difficult to optimize with B-trees, because the B-tree is sorted primarily along one axis.

The point being that your WHERE clause has to evaluate the expensive trig functions on all 25 million rows, instead of being able to reduce the result set by using the index.

One solution is to use bounding boxes to reduce the scope of the search. That is, if you know @lat, then you could use WHERE latitude BETWEEN @lat-25 AND @lat+25 AND ...trig expression... Because AND only evaluates the right operand if the left operand is true, this would help by reducing the possible matches more efficiently.

Unfortunately you can't use a single B-tree lookup to filter on both latitude and longitude simultaneously, even if you use a compound index. Think about this: I ask you to look up names in a phone book, for anyone whose last name begins with "S" and whose first name beings with "J". The phone book is like an index on lastname, firstname, but the firstnames are not sorted together. You end up having to search all the "S" lastnames, as if you had only that column indexed.

There are other technologies besides B-trees, that make it easier to do these kinds of multidimensional searches. One is Sphinx Search. See An introduction to distance-based searching in Sphinx.

Another is to use some of the builtin features of MySQL 5.6, but it will be indexed only if you store data in MyISAM (which I usually recommend against using).

See Alexander Rubin's excellent resources on geospatial searches in MySQL:

Using the new spatial functions in MySQL 5.6 for geo-enabled applications (blog post)
Geo Distance Search with MySQL (presentation)

Thank you so much, Bill. I've seen your other posts and you are always helpful. We are using Sphinx, btw, but for what I need to do I can't utilize it due to max record limitations in Sphinx. But everything else, I'm going to take some time reading and I might look at the spacial function (I have a table already hat has a POINT column with the point values for the corresponding lat/long coords. — kambythet, Sep 19 '14 at 23:27
I would use MyISAM table only to store a *copy* of the data for indexing purposes, while the original data is safely in a transactional storage engine. — Bill Karwin, Sep 20 '14 at 00:02
I will definitely do that. I know MyISAM is much faster on queries such as this, so I'm going to give that a shot as well. — kambythet, Sep 20 '14 at 00:04
Spatial indexing will finally be supported in InnoDB in MySQL 5.7 according to some dev blog posts I have seen. — John Powell, Sep 20 '14 at 06:39
@JohnBarça, thanks for the tip! Do you have a link to one of those blog posts? — Bill Karwin, Sep 20 '14 at 19:31
@BillKarwin. http://mysqlserverteam.com/innodb-spatial-indexes-in-5-7-4-lab-release/ is the one I came accross. I'm not sure what LAB release means, though I'm sure you will. This really will be a valuable enhancement that has been a long time in coming. — John Powell, Sep 21 '14 at 07:18

score 0 · Answer 2 · answered Sep 19 '14 at 23:36

the math is causing a full table scan each time. If you have the possibilty to store its result eg. per cronjob than you should do it. another way will be to add some other indexed condition before the math to reduce the number of the examinated rows.

MySQL Query performance issue using group by

2 Answers2