1

The following shortened query selects all rows (entrys) inside a given distance (calculated from user e.altloc = 0: location or e.altloc = 1: altlocation ).

I have indexes on e.uid, al.eid, e.country, e.tmstmp and the id's are primary keys.

The problem according to explain all rows are need to process the query, instead of 2 rows which I like with limit 2.

I have read this question, but I'm not able to do the Limit before using a join, because I need to join the location tables before I can do the limit 2 else the return would be wrong. https://dba.stackexchange.com/questions/52079/does-using-limit-improve-the-performance-and-is-it-noticeable

The query:

SELECT 
        e.id, e.uid, e.title, e.description, l.place, l.placenonce, al.altplace, al.altplacenonce,
        IF(e.altloc=0,
            6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(lat, UNHEX('###'), latnonce) ) ) * cos( radians( AES_DECRYPT(lng, UNHEX('###'), lngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(lat, UNHEX('###'), latnonce))) ) ,
            6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(altlat, UNHEX('###'), altlatnonce) ) ) * cos( radians( AES_DECRYPT(altlng, UNHEX('###'), altlngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(altlat, UNHEX('###'), altlatnonce))) )
        ) AS distance
    FROM 
        entrys e 
    INNER JOIN 
        location l 
        ON l.id = e.uid 
    LEFT JOIN
        altlocation al
        ON al.eid = e.id
    WHERE 
        IF(:border = 0, e.country = :countryid, e.country != 0 )    
    HAVING 
        distance <= 50
    ORDER BY 
        e.tmstmp 
    DESC
    LIMIT 2

A second example with a fixed location:

SELECT 
    s.id, s.image, s.description, s.title,      
    ( 
        6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(l.lat, :key, l.latnonce) ) ) * cos( radians( AES_DECRYPT(l.lng, :key, l.lngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(l.lat, :key, l.latnonce))) ) 
    ) AS distance
FROM 
    sponsors s 
INNER JOIN 
    location l 
    ON l.id = s.id 
WHERE 
    s.comp = 1 OR s.comp = 3 AND s.active = 1
HAVING 
    distance <= 50
ORDER BY
    s.rotate
ASC
LIMIT 2

How to improve this location based querys, if there are million of rows in my database? I need to output only 2 rows of each query.

Create table for the first example:

  CREATE TABLE `entrys` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `uid` int(5) NOT NULL,
 `tmstmp` bigint(11) NOT NULL,
 `approx_lat` mediumint(9) NOT NULL,
 `approx_lng` mediumint(9) NOT NULL,
 `altloc` tinyint(4) NOT NULL,
 `title` varchar(70) COLLATE latin1_general_ci NOT NULL,
 `description` text COLLATE latin1_general_ci NOT NULL,
 `country` tinyint(4) NOT NULL,
 PRIMARY KEY (`id`),
 KEY `uid` (`uid`),
 KEY `tmstmp` (`tmstmp`),
 KEY `country` (`country`),
) ENGINE=MyISAM CHARSET=latin1 COLLATE=latin1_general_ci

CREATE TABLE `location` (
 `id` int(5) NOT NULL,
 `lat` varbinary(50) NOT NULL,
 `latnonce` varbinary(25) NOT NULL,
 `lng` varbinary(50) NOT NULL,
 `lngnonce` varbinary(25) NOT NULL,
 `place` tinyblob NOT NULL,
 `placenonce` tinyblob NOT NULL,
 UNIQUE KEY `id` (`id`),
 KEY `lat` (`lat`),
 KEY `lng` (`lng`)
) 

CREATE TABLE `altlocation` (
 `id` int(5) NOT NULL,
 `eid` int(5) NOT NULL,
 `altlat` varbinary(50) NOT NULL,
 `altlatnonce` varbinary(25) NOT NULL,
 `altlng` varbinary(50) NOT NULL,
 `altlngnonce` varbinary(25) NOT NULL,
 `altplace` tinyblob NOT NULL,
 `altplacenonce` tinyblob NOT NULL,
 UNIQUE KEY `eid` (`eid`),
 KEY `altlat` (`altlat`),
 KEY `altlng` (`altlng`)
)

sidenote: the engine for entrys should probably be innodb, having ~70% read. The location tables both running with innodb.

EDIT question to Willem Renzema for his answer:

Would it be more efficient like that?

SELECT 
        e.id, e.uid, e.title, e.description, l.place, l.placenonce, al.altplace, al.altplacenonce,
        IF(e.altloc=0,
            6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(lat, UNHEX('###'), latnonce) ) ) * cos( radians( AES_DECRYPT(lng, UNHEX('###'), lngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(lat, UNHEX('###'), latnonce))) ) ,
            6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(altlat, UNHEX('###'), altlatnonce) ) ) * cos( radians( AES_DECRYPT(altlng, UNHEX('###'), altlngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(altlat, UNHEX('###'), altlatnonce))) )
        ) AS distance
    FROM 
        (
            SELECT id, uid, title, description
            FROM 
                entrys 
            WHERE 
                    approx_lat > :min_lat
                AND approx_lat < :max_lat
                AND approx_lng > :min_lng
                AND approx_lng < :min_lng   
            ORDER BY 
                e.tmstmp 
            DESC
            LIMIT 2
            
        ) AS e
    INNER JOIN 
        location l 
    ON l.id = uid 
    LEFT JOIN
        altlocation al
    ON al.eid = e.id
    HAVING 
        distance <= 50

If I would add approx_lat and approx_lng to the entry table. The clue would be moving approx_lat and approx_lng to the entry table, that I could insert altlocation OR location only, so I could get rid off IF inside the query.

Is HAVING distance <= 50still necessary?

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
delato468
  • 474
  • 4
  • 18
  • where is your LIMIT clause in the given query ? – Madhur Bhaiya Sep 11 '18 at 12:53
  • In this case you cannot use limit as an optimisation technique. It does happen. However, if you did explain your problem better with sample data and expected results, then we may be able to help. – Shadow Sep 11 '18 at 12:54
  • @MadhurBhaiya sorry i forgot to add in this example code. – delato468 Sep 11 '18 at 12:55
  • @Shadow wich kind of sample data i should provide ? Whats inside the rows ? As example a entry could be a message created by a user (with title, description ...) with a fixed or unfixed location. Other users can watch out 2 entrys in the given distance from their fixed location. – delato468 Sep 11 '18 at 13:07
  • @Shadow fixed location would always be the home location, a unfixed location would be a location on the way given by the user. Entrys are ordered by timestamp. – delato468 Sep 11 '18 at 13:14
  • You do not have to provide actual data, but you need to provide a sample that is representative of your problem. – Shadow Sep 11 '18 at 16:27
  • @Shadow i dont exactly understand what you mean. Everything important is in my question. Additionally i could show the explain plan with the limit clause where it shows all rows are selected to execute the query. But the explain plan does not work well together with limit. The query above is shortened amd reworked to show the problem and i provided enough text to describe the problem well. (2 location calculation joins where only one of them limit the data a simple where and the limit by 2 wich would not work efficiently. -> the question is how i can improve the query/use limit efficiently) – delato468 Sep 12 '18 at 06:06
  • That your question is perfect is supported by the large number of fantastic answers received, right? :) – Shadow Sep 12 '18 at 07:02
  • @Shadow only 17 visits and 3 of them would be you and 5 me xD. If you would tell me exavtly what i should provide you i will do ^^ I can bring a similar query example, but the main princip is always the same. – delato468 Sep 12 '18 at 07:22
  • @delato468 Please add the output of `SHOW CREATE TABLE tablename` for each of the tables in your queries. – Willem Renzema Sep 12 '18 at 12:48
  • @WillemRenzema of course, i will add into the question as an edit. – delato468 Sep 12 '18 at 13:01
  • @delato468 Is having unencrypted values for the `lat`, `lng`, `altlat` and `altlng` an option? The use of `varbinary` columns that have to be decrypted, rather than having simple decimal values, is going to destroy your performance. – Willem Renzema Sep 12 '18 at 13:58
  • @WillemRenzema well its a yes/no case, because this values are cutted. The real location is encrypted with libsodium and taken only if its necessary. Like that the lat and lng is inaccurate and i probably could store them uncrypted, but im not sure if its a good decision. Most likely i do also would encrypt those, but if there is no other way oO. For example instead of lat: 54,23453 i store 54,234 in there. – delato468 Sep 12 '18 at 14:11
  • Like that the real location can be different ~ 100m x 100m, if i would cut until the second decimal place ~600m x 600m. The problem is in some cases there can be a single house in that 100x100m radius and bigger radius would end up in inaccured searches. – delato468 Sep 12 '18 at 14:38
  • The `LIMIT 2` in the inner query will speed it up, but give you the "wrong" answer. – Rick James Sep 30 '18 at 02:01
  • @Rick James thats true, so i did in the inner a limit 20 and in the outer a limit 2 in meantime (the chance is small all 20 entrys are over 50 distance - 10% i could improve by limiting 100, but at the moment i like to test with 20). I think that will solve the problem most likely. The speed up is enorm, and together with a time check tmstmp < and > it will be a must for a big database. – delato468 Oct 01 '18 at 06:15
  • Some day you will find that even 20 leads to zero results. This can happen when all the hits are hiding in the corners of the bounding box. Don't accept performance without correctness. – Rick James Oct 01 '18 at 16:34
  • @Rick James thats true, but the chanve is very small. Also i could invrease the value to 100 or 500. At this point its not possible to recieve 0 values, because even tough im selecting the whole circle, the area wich will get exluded is only 20%. The error chance on 100 or 500 limit by recieving 2 entrys need only is very small. For boosting server performance by 500% its a good deal in my opinion. Bigger problem is this: https://stackoverflow.com/questions/52309258/could-i-store-user-geolocation-data-lat-lng-unencrypted-but-cutted-on-my-databas no one can give me a good snswer there – delato468 Oct 01 '18 at 20:09
  • I also could remove the having distance, so the limit 2 would work exactly. There is no other way to limit out and improve performance else tell us please. Even tough im using the bounding box and last year clause there still can be 1 millionen rows selected on big cities. As i said some of those qurrys have to run on each site refresh. This would end up fast in high load. With the 100 or even 500 limit this query wouldnt harm a bit. – delato468 Oct 01 '18 at 20:15
  • @delato468 - I thought Willem's answer _without_ the subquery and with the distance check in the HAVING was more efficient, etc. Did you try it? – Rick James Oct 01 '18 at 20:43
  • `approx_lat` is nowhere to be found, yet it is used in the subquery. Please fix this error. After that, I think I can propose some improvements in the use of the subquery. – Rick James Oct 01 '18 at 20:47
  • @RickJames How it can be more efficient. For sure i tried. Its selecting all rows in the given area. The outer Limit 2 is worth norhing for performance. All rows are selected wich match the where clause. When there are 1 million rows in this area all are need to process the query, but with the subquery ONLY that rows get selected wich matches the limit. So 20 or 100 or 500 what ever i will set there. If there are less rows then it is no problem, but if there are million rows the subquery limit increase the performance infinitely. – delato468 Oct 02 '18 at 06:49
  • Have you added `INDEX(approx_lat), INDEX(approx_lng)`? – Rick James Oct 02 '18 at 16:09
  • @RickJames yes, you can try by your self. If there are 1 million rows in an area of 250km² then all are need to process. Because all rows inside this area match the approx lat and lng clause. For future improving and that is what im trying to do, it would be stupid to ignore that fact. As i said i need only 2 rows, so each row more is wasted performance, but limit 20 - 50 - 100 is still ok in comparison to 10 k +. – delato468 Oct 02 '18 at 21:28
  • The problem with limit 2 on outer query is after joining it does not work anymore. So i do all the important calculations inside the subquery and join other tables later. I could try to run the having and distance calculations inside the subquery, may that could work then, but i need to test if that wouldnt break the limit 2 inside the subquery. Resp. As william said the if will do that 100%. – delato468 Oct 02 '18 at 21:36

2 Answers2

2

Use a bounding box in your query.

Example (only changes are in the WHERE clause):

SELECT 
    e.id, e.uid, e.title, e.description, l.place, l.placenonce, al.altplace, al.altplacenonce,
    IF(e.altloc=0,
        6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(lat, UNHEX('###'), latnonce) ) ) * cos( radians( AES_DECRYPT(lng, UNHEX('###'), lngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(lat, UNHEX('###'), latnonce))) ) ,
        6371 * acos( cos( radians(:lat) ) * cos( radians( AES_DECRYPT(altlat, UNHEX('###'), altlatnonce) ) ) * cos( radians( AES_DECRYPT(altlng, UNHEX('###'), altlngnonce) ) - radians(:lng) ) + sin( radians(:lat) ) * sin(radians(AES_DECRYPT(altlat, UNHEX('###'), altlatnonce))) )
    ) AS distance
FROM 
    entrys e 
INNER JOIN 
    location l 
    ON l.id = e.uid 
LEFT JOIN
    altlocation al
    ON al.eid = e.id
WHERE 
    e.country = :countryid
    AND l.approx_lat > :min_lat
    AND l.approx_lat < :max_lat
    AND l.approx_lng > :min_lng
    AND l.approx_lng < :min_long    
HAVING 
    distance <= 50
ORDER BY 
    e.tmstmp 
DESC
LIMIT 2

You would compute the :min_lat, :max_lat, :min_lng, and :max_lng before you execute the query. Those values would be generated from the desired radius from your :lat and :lng values (in this case, 50).

How exactly to do that I suggest reading one of the many other answers, such as this one, that are all over the internet. Just search for geolocation bounding box to get started.

Then, you can further improve performance by adding an index on the approx_lat and approx_lng columns. You can also try adding a couple composite indexes, of (approx_lat,approx_lng) and/or (approx_lng,approx_lat), as the optimizer MAY be able to make use of those. However, those are things I would strongly advise benchmarking to see if they provide any improvement or not. Additional columns to make these covering indexes may also help, but I'm focusing on the most basic issues at the moment.

Note that what you are trying to optimize is already a difficult optimization problem. The fact that you need to encrypt your data, makes it even harder. However, so long as you can store these approximate values, we can bypass most of that extra difficulty.

I would also strongly advice you to keep IF logic out of your WHERE clause. By including that, you force the optimizer to look up every single record to see if it matches that condition or not.

In general, to get good performance, you need to limit the number of records that need to be checked. The IF statement cannot be optimized (it is not sargable). This is also why my answer requires you to store approximate values in order to be effective. If the data has to be decrypted first, that means that every single record has to be looked up and checked. That is what is going to kill your performance.

Also note that in my example query I ignored the altlocation table in the WHERE clause. Ideally, if location and altlocation are identical, you should have only a single table for the data, and then JOIN to that table on the from a record that records a location id as being a primary or an "alternate".

I hope this can at least help you on the right direction.

Willem Renzema
  • 5,177
  • 1
  • 17
  • 24
  • Thats really a nice and in detail answer. Thank you very much. I still have a question. As i understood well i need to leave the data unencrypted and witouth decimal places ? For altlocation and location it depends if the entry.altloc is 1 or 0, -> 1 take value from table altlocation, 0 take value from table location (thats an important part of the query). And the if inside the where clause limit the search for countrys or not (" take rows only inside the country / or doesnt"). – delato468 Sep 13 '18 at 07:09
  • The decimal place problem i could bypass simple by leave a way the comma and care all values have the same length. The IF in the select query does not harm performance ? – delato468 Sep 13 '18 at 07:17
  • I did also added a modified version of your code, would my example be more efficient ? – delato468 Sep 13 '18 at 11:31
  • 1
    @delato468 I believe your edited version would be an improvement. It is important to always benchmark any changes you make. As for whether you still need `HAVING distance <= 50`, yes, you still need that part, as with the bounding box you will have many results outside that distance. The bounding box narrows down the number of potential results, which provides a great performance boost, but then the rest of your query (the SELECT and HAVING parts) narrow down the results the rest of the way to get just those in the correct distance radius. – Willem Renzema Sep 13 '18 at 15:44
  • More than that, the bounding box lets `INDEX(lat)` _or_ `INDEX(lng)` be used. – Rick James Sep 30 '18 at 01:59
  • Even better would be `INDEX(country, lat), INDEX(country, lng)` – Rick James Sep 30 '18 at 02:02
  • INDEX(approx_lat) and INDEX(approx_lng) is fine lat and lng arent necessary performancewise, but INDEX(country) is important too yes. – delato468 Oct 02 '18 at 06:56
1

(A partial Answer.)

A useful tip for subqueries (sometimes).

  • Note that there are several (uid, title, description) bulky columns hauled around in the subquery.
  • There is an ORDER BY and LIMIT, so hauling them around takes some effort.

So,

  • Use the minimum number of columns in the subquery, being sure to include the row id.
  • After the subquery, add a JOIN (via the id) to get those extra columns.
  • Furthermore have a "covering" index that contains all the columns that remain in the subquery: INDEX(approx_lat, approx_lng, tmstmp, id)
Rick James
  • 135,179
  • 13
  • 127
  • 222