It is possible to query more than one IP address. Several approaches we could take. Assuming range_start
and range_end
are defined as integer types.
For a reasonable number of ip addresses, we could use an inline view:
SELECT i.ip, a.*
FROM ( SELECT 3232235520 AS ip
UNION ALL SELECT 3232235521
UNION ALL SELECT 3232235522
UNION ALL SELECT 3232235523
UNION ALL SELECT 3232235524
UNION ALL SELECT 3232235525
) i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip
This approach will work for a reasonable number of IP addresses. The inline view could be extended with more UNION ALL SELECT
to add additional IP addresses. But that's not necessarily going to work for a "huge" number.
When we get "huge", we're going to run into limitations in MySQL... maximum size of a SQL statement limited by max_allowed_packet
, there may be a limit on the number of SELECT
that can appear.
The inline view could be replaced with a temporary table, built first.
DROP TEMPORARY TABLE IF EXISTS _ip_list_;
CREATE TEMPORARY TABLE _ip_list_ (ip BIGINT NOT NULL PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO _ip_list_ (ip) VALUES (3232235520),(3232235521),(3232235522),...;
...
INSERT INTO _ip_list_ (ip) VALUES (3232237989),(3232237990);
Then reference the temporary table in place of the inline view:
SELECT i.ip, a.*
FROM _ip_list_ i
LEFT
JOIN ip_to_asn a
ON a.range_start <= i.ip
AND a.range_end >= i.ip
ORDER BY i.ip ;
And then drop the temporary table:
DROP TEMPORARY TABLE IF EXISTS _ip_list_ ;
Some other notes:
Churning database connections is going to degrade performance. There's a significant amount overhead in establishing and tearing down a connection. That overhead get noticeable if the application is repeatedly connecting and disconnecting, if its doing that for every SQL statement being issued.
And running an individual SQL statement also has overhead... the statement has to be sent to the server, the statement parsed for syntax, evaluated from semantics, choose an execution plan, execute the plan, prepare a resultset, return the resultset to the client. And this is why it's more efficient to process set wise rather than row wise. Processing RBAR (row by agonizing row) can be very slow, compared to sending a statement to the database and letting it process a set in one fell swoop.
But there's a tradeoff there. With ginormous sets, things can start to get slow again.
Even if you can process two IP addresses in each statement, that halves the number of statements that need to be executed. If you do 20 IP addresses in each statement, that cuts down the number of statements to 5% of the number that would be required a row at a time.
And the composite index already defined on (range_start,range_end)
is appropriate for this query.
FOLLOWUP
As Rick James points out in a comment, the index I earlier said was "appropriate" is less than ideal.
We could write the query a little differently, that might make more effective use of that index.
If (range_start,range_end)
is UNIQUE (or PRIMARY) KEY, then this will return one row per IP address, even when there are "overlapping" ranges. (The previous query would return all of the rows that had a range_start and range_end that overlapped with the IP address.)
SELECT t.ip, a.*
FROM ( SELECT s.ip
, s.range_start
, MIN(e.range_end) AS range_end
FROM ( SELECT i.ip
, MAX(r.range_start) AS range_start
FROM _ip_list_ i
LEFT
JOIN ip_to_asn r
ON r.range_start <= i.ip
GROUP BY i.ip
) s
LEFT
JOIN ip_to_asn e
ON e.range_start = s.range_start
AND e.range_end >= s.ip
GROUP BY s.ip, s.range_start
) t
LEFT
JOIN ip_to_asn a
ON a.range_start = t.range_start
AND a.range_end = t.range_end
ORDER BY t.ip ;
With this query, for the innermost inline view query s
, the optimizer might be able to make effective use of an index with a leading column of range_start
, to quickly identify the "highest" value of range_start
(that is less than or equal to the IP address). But with that outer join, and with the GROUP BY on i.ip
, I'd really need to look at the EXPLAIN output; it's only conjecture what the optimizer might do; what is important is what the optimizer actually does.)
Then, for inline view query e
, MySQL might be able to make more effective use of the composite index on (range_start,range_end)
, because of the equality predicate on the first column, and the inequality condition on MIN aggregate on the second column.
For the outermost query, MySQL will surely be able to make effective use of the composite index, due to the equality predicates on both columns.
A query of this form might show improved performance, or performance might go to hell in a handbasket. The output of EXPLAIN
should give a good indication of what's going on. We'd like to see "Using index for group-by" in the Extra column, and we only want to see a "Using filesort" for the ORDER BY on the outermost query. (If we remove the ORDER BY clause, we want to not see "Using filesort" in the Extra column.)
Another approach is to make use of correlated subqueries in the SELECT list. The execution of correlated subqueries can get expensive when the resultset contains a large number of rows. But this approach can give satisfactory performance for some use cases.
This query depends on no overlapping ranges in the ip_to_asn
table, and this query will not produce the expected results when overlapping ranges exist.
SELECT t.ip, a.*
FROM ( SELECT i.ip
, ( SELECT MAX(s.range_start)
FROM ip_to_asn s
WHERE s.range_start <= i.ip
) AS range_start
, ( SELECT MIN(e.range_end)
FROM ip_to_asn e
WHERE e.range_end >= i.ip
) AS range_end
FROM _ip_list_ i
) r
LEFT
JOIN ip_to_asn a
ON a.range_start = r.range_start
AND a.range_end = r.range_end
As a demonstration of why overlapping ranges will be a problem for this query, given a totally goofy, made up example
range_start range_end
----------- ---------
.101 .160
.128 .244
Given an IP address of .140
, the MAX(range_start)
subquery will find .128
, the MIN(range_end)
subquery will find .160
, and then the outer query will attempt to find a matching row range_start=.128 AND range_end=.160
. And that row just doesn't exist.