Mysql where between query optimization

Question

Below is the format of the database of Autonomous System Numbers ( download and parsed from this site! ).

range_start  range_end  number  cc  provider
-----------  ---------  ------  --  -------------------------------------
   16778240   16778495   56203  AU  AS56203 - BIGRED-NET-AU Big Red Group
   16793600   16809983   18144      AS18144

745465 total rows

A Normal query looks like this:

select * from table where 3232235520 BETWEEN range_start AND range_end

Works properly but I query a huge number of IPs to check for their AS information which ends up taking too many calls and time.

Profiler Snapshot:

Blackfire profiler snapshot

I've two indexes:

id column
a combine index on the range_start and range_end column as both the make unique row.

Questions:

Is there a way to query a huge number of IPs in a single query?
- multiple where (IP between range_start and range_end) OR where (IP between range_start and range_end) OR ... works but I can't get the IP -> row mapping or which rows are retrieved for which IP.
Any suggestions to change the database structure to optimize the query speed and decrease the time?

Any help will be appreciated! Thanks!

`select * from table where 3232235520 BETWEEN range_start AND range_end` I don't understand how this is a normal query. Where does 3232235520 come from? That said, if (range_start,range_end) is indexed then these queries should be blisteringly quick. — Strawberry, Feb 28 '17 at 17:00
@Strawberry: `3232235520` is a decimal representation of the IP address that Amar wants to search for. IPv4 addresses are 32-bit integers. `0xC0A80000` is `192.168.0.0`. OP issue is that even though the execution of a single query is blazingly fast, OP is running a bazillion of those individual queries. One IP address at at time. And we know that processing RBAR (row by agonizing row) is slow. Running a bazillion blazingly fast queries is very expensive. And still very slow. — spencer7593, Feb 28 '17 at 17:08
@Strawberry I do the IP conversion to decimal form `3232235520` before querying. — Amar Myana, Feb 28 '17 at 17:23
If your table is huge, see [_my solution_](https://mariadb.com/kb/en/ip-range-table-performance/) — Rick James, Mar 01 '17 at 23:20

spencer7593 · Accepted Answer · 2017-03-02T01:02:59.283

It is possible to query more than one IP address. Several approaches we could take. Assuming range_start and range_end are defined as integer types.

For a reasonable number of ip addresses, we could use an inline view:

 SELECT i.ip, a.*
   FROM (           SELECT 3232235520 AS ip
          UNION ALL SELECT 3232235521
          UNION ALL SELECT 3232235522
          UNION ALL SELECT 3232235523
          UNION ALL SELECT 3232235524
          UNION ALL SELECT 3232235525
        ) i
   LEFT 
   JOIN ip_to_asn a
     ON a.range_start <= i.ip
    AND a.range_end   >= i.ip
  ORDER BY i.ip

This approach will work for a reasonable number of IP addresses. The inline view could be extended with more UNION ALL SELECT to add additional IP addresses. But that's not necessarily going to work for a "huge" number.

When we get "huge", we're going to run into limitations in MySQL... maximum size of a SQL statement limited by max_allowed_packet, there may be a limit on the number of SELECT that can appear.

The inline view could be replaced with a temporary table, built first.

 DROP TEMPORARY TABLE IF EXISTS _ip_list_;
 CREATE TEMPORARY TABLE _ip_list_ (ip BIGINT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
 INSERT INTO _ip_list_ (ip) VALUES (3232235520),(3232235521),(3232235522),...;
 ...
 INSERT INTO _ip_list_ (ip) VALUES (3232237989),(3232237990);

Then reference the temporary table in place of the inline view:

 SELECT i.ip, a.*
   FROM _ip_list_ i
   LEFT
   JOIN ip_to_asn a
     ON a.range_start <= i.ip
    AND a.range_end   >= i.ip
  ORDER BY i.ip ;

And then drop the temporary table:

 DROP TEMPORARY TABLE IF EXISTS _ip_list_ ;

Some other notes:

Churning database connections is going to degrade performance. There's a significant amount overhead in establishing and tearing down a connection. That overhead get noticeable if the application is repeatedly connecting and disconnecting, if its doing that for every SQL statement being issued.

And running an individual SQL statement also has overhead... the statement has to be sent to the server, the statement parsed for syntax, evaluated from semantics, choose an execution plan, execute the plan, prepare a resultset, return the resultset to the client. And this is why it's more efficient to process set wise rather than row wise. Processing RBAR (row by agonizing row) can be very slow, compared to sending a statement to the database and letting it process a set in one fell swoop.

But there's a tradeoff there. With ginormous sets, things can start to get slow again.

Even if you can process two IP addresses in each statement, that halves the number of statements that need to be executed. If you do 20 IP addresses in each statement, that cuts down the number of statements to 5% of the number that would be required a row at a time.

And the composite index already defined on (range_start,range_end) is appropriate for this query.

FOLLOWUP

As Rick James points out in a comment, the index I earlier said was "appropriate" is less than ideal.

We could write the query a little differently, that might make more effective use of that index.

If (range_start,range_end) is UNIQUE (or PRIMARY) KEY, then this will return one row per IP address, even when there are "overlapping" ranges. (The previous query would return all of the rows that had a range_start and range_end that overlapped with the IP address.)

 SELECT t.ip, a.*
   FROM ( SELECT s.ip
               , s.range_start
               , MIN(e.range_end) AS range_end
            FROM ( SELECT i.ip
                        , MAX(r.range_start) AS range_start
                     FROM _ip_list_ i
                     LEFT
                     JOIN ip_to_asn r
                       ON r.range_start <= i.ip
                    GROUP BY i.ip
                 ) s
            LEFT
            JOIN ip_to_asn e
              ON e.range_start = s.range_start
             AND e.range_end  >= s.ip
           GROUP BY s.ip, s.range_start
        ) t
   LEFT
   JOIN ip_to_asn a
     ON a.range_start = t.range_start
    AND a.range_end   = t.range_end
  ORDER BY t.ip ;

With this query, for the innermost inline view query s, the optimizer might be able to make effective use of an index with a leading column of range_start, to quickly identify the "highest" value of range_start (that is less than or equal to the IP address). But with that outer join, and with the GROUP BY on i.ip, I'd really need to look at the EXPLAIN output; it's only conjecture what the optimizer might do; what is important is what the optimizer actually does.)

Then, for inline view query e, MySQL might be able to make more effective use of the composite index on (range_start,range_end), because of the equality predicate on the first column, and the inequality condition on MIN aggregate on the second column.

For the outermost query, MySQL will surely be able to make effective use of the composite index, due to the equality predicates on both columns.

A query of this form might show improved performance, or performance might go to hell in a handbasket. The output of EXPLAIN should give a good indication of what's going on. We'd like to see "Using index for group-by" in the Extra column, and we only want to see a "Using filesort" for the ORDER BY on the outermost query. (If we remove the ORDER BY clause, we want to not see "Using filesort" in the Extra column.)

Another approach is to make use of correlated subqueries in the SELECT list. The execution of correlated subqueries can get expensive when the resultset contains a large number of rows. But this approach can give satisfactory performance for some use cases.

This query depends on no overlapping ranges in the ip_to_asn table, and this query will not produce the expected results when overlapping ranges exist.

 SELECT t.ip, a.*
   FROM ( SELECT i.ip
               , ( SELECT MAX(s.range_start)
                     FROM ip_to_asn s
                    WHERE s.range_start <= i.ip
                 ) AS range_start
               , ( SELECT MIN(e.range_end)
                     FROM ip_to_asn e
                    WHERE e.range_end >= i.ip
                 ) AS range_end
            FROM _ip_list_ i
        ) r
   LEFT 
   JOIN ip_to_asn a
     ON a.range_start = r.range_start
    AND a.range_end   = r.range_end

As a demonstration of why overlapping ranges will be a problem for this query, given a totally goofy, made up example

range_start  range_end 
-----------  ---------
       .101       .160
       .128       .244

Given an IP address of .140, the MAX(range_start) subquery will find .128, the MIN(range_end) subquery will find .160, and then the outer query will attempt to find a matching row range_start=.128 AND range_end=.160. And that row just doesn't exist.

Yes, range_start and range_end are integer types. I've tested both the specified methods and they work correctly. I'm implementing the temporary table creation method as its taking lesser time than the inline view for more number of IPs. For more than 100 IPs: Inline view took **53.4641 seconds** Temporary table took **16 seconds** — Amar Myana, Feb 28 '17 at 17:32
@AmarMyana: The results you report look reasonable. For the temporary table approach, I'd be sure make use of the multi-row insert. And be sure to include the cost (time) for the extra statements to create the temporary table and populate it, in comparing performance. (The inline view is performing a similar operation in materializing a derived table.) — spencer7593, Feb 28 '17 at 18:01
Such a composite index is not as useful as you would like. The optimizer does not know whether there could be overlapping ranges. — Rick James, Mar 01 '17 at 23:18
@RickJames: MySQL can do a range scan with the leading column `range_start`. It's not ideal, given an IP that falls in the midpoint of the index, that's only eliminating half of the rows. All of the rows that satisfy the condition on `range_start` will need to be checked. I'd hope that MySQL could use the second column in the index to evaluate the condition on `range_end`, even if it is a full scan through the index entries that weren't eliminated. That would avoid lookups to pages in the underlying table. If this is InnoDB table, if the cluster key has `range_start` as the leading column, ... — spencer7593, Mar 01 '17 at 23:44
(cont.) the secondary index doesn't buy us much. Otherwise, the best we can hope for (given the current table definition) would be a covering index. Some queries for really "low" range_start should perform better than queries for "high" range_start. With an even distribution, on average, a lookup of an IP will require inspection of half of the entries. Not ideal. But that's better than a full scan of *every* row. We can suggest some changes to optimize, but my answer was really mostly about avoiding processing RBAR, one excruciating IP at a time. — spencer7593, Mar 01 '17 at 23:48
Answer updated with **FOLLOWUP** section to provide some alternative query patterns, which may have different execution plans, and therefore exhibit different performance profiles. All of these queries use the existing table definition and performance will depend on suitable B-tree indexes being available. All of the followup queries are assuming an `_ip_list_` table which is pre-populated with the IP addresses we're searching for. — spencer7593, Mar 02 '17 at 01:07

score 0 · Answer 2 · edited May 23 '17 at 12:10

0

You can compare IP ranges using MySQL. This question might contain an answer you're looking for: MySQL check if an IP-address is in range?

SELECT * FROM TABLE_NAME WHERE (INET_ATON("193.235.19.255") BETWEEN INET_ATON(ipStart) AND INET_ATON(ipEnd));

You will likely want to index your database. This optimizes the time it takes to search your database, similar to the index you will find in the back of a textbook, but for databases:
```
ALTER TABLE `table` ADD INDEX `name` (`column_id`)
```

EDIT: Apparently INET_ATON cannot be used on indexed databases, so you would have to pick one of these!

edited May 23 '17 at 12:10

Community

1
1

answered Feb 28 '17 at 16:31

robere2

1,689
2
16
26

1

INET_ATON(column) cannot use an index – Strawberry Feb 28 '17 at 17:00
Dang you're right! I edited my post to include that. – robere2 Feb 28 '17 at 17:24
It's in good company. In MySQL, functions cannot use indexes. – Strawberry Feb 28 '17 at 17:35
2

Solution: Store the the output of `INET_ATON()` in ipStart and ipEnd, not the string. – Rick James Mar 01 '17 at 23:17

score 0 · Answer 3 · edited May 23 '17 at 10:29

This is a duplicate of the question here however I'm not voting to close it, as the accepted answer in that question is not very helpful; the answer by Quassnoi is much better (but it only links to the solution).

A linear index is not going to help resolve a database of ranges. The solution is to use geospatial indexing (available in MySQL and other DBMS). An added complication is that MySQL geospatial indexing only works in 2 dimensions (while you have a 1-D dataset) so you need to map this to 2-dimensions.

Hence:

CREATE TABLE IF NOT EXISTS `inetnum` (
  `from_ip` int(11) unsigned NOT NULL,
  `to_ip` int(11) unsigned NOT NULL,
  `netname` varchar(40) default NULL,
  `ip_txt` varchar(60) default NULL,
  `descr` varchar(60) default NULL,
  `country` varchar(2) default NULL,
  `rir` enum('APNIC','AFRINIC','ARIN','RIPE','LACNIC') NOT NULL default 'RIPE',
  `netrange` linestring NOT NULL,
  PRIMARY KEY  (`from_ip`,`to_ip`),
  SPATIAL KEY `rangelookup` (`netrange`)
) ENGINE=MyISAM DEFAULT CHARSET=ascii;

Which might be populated with....

INSERT INTO inetnum
(from_ip, to_ip
 , netname, ip_txt, descr, country
 , netrange)
VALUES
(INET_ATON('127.0.0.0'), INET_ATON('127.0.0.2') 
 , 'localhost','127.0.0.0-127.0.0.2', 'Local Machine', '.',
GEOMFROMWKB(POLYGON(LINESTRING(
   POINT(INET_ATON('127.0.0.0'), -1), 
   POINT(INET_ATON('127.0.0.2'), -1),
   POINT(INET_ATON('127.0.0.2'), 1), 
   POINT(INET_ATON('127.0.0.0'), 1),
   POINT(INET_ATON('127.0.0.0'), -1))))
);

Then you might want to create a function to wrap the rather verbose SQL....

DROP FUNCTION `netname2`//
CREATE DEFINER=`root`@`localhost` FUNCTION `netname2`(p_ip VARCHAR(20) CHARACTER SET ascii) RETURNS varchar(80) CHARSET ascii
   READS SQL DATA
   DETERMINISTIC
BEGIN
  DECLARE l_netname varchar(80);

  SELECT CONCAT(country, '/',netname)
    INTO l_netname
  FROM inetnum
  WHERE MBRCONTAINS(netrange, GEOMFROMTEXT(CONCAT('POINT(', INET_ATON(p_ip), ' 0)')))
  ORDER BY (to_ip-from_ip)
  LIMIT 0,1;

  RETURN l_netname;
END

And therefore:

SELECT netname2('127.0.0.1');

./localhost

Which uses the index:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  inetnum     range   rangelookup     rangelookup     34  NULL    1   Using where; Using filesort

(and takes around 10msec to find a record from the combined APNIC,AFRINIC,ARIN,RIPE and LACNIC datasets on the very low spec VM I'm using here)

Mysql where between query optimization

3 Answers3