I am looking for help with optimizing my BigQuery query.
I have 2 tables:
1) ips
with one column: ip
which is IP-like string, e.g. 192.168.4.6
. It has m rows
----------------
| ip |
----------------
| 73.14.170.37 |
----------------
| 5.14.121.34 |
----------------
| 61.22.122.67 |
---------------
2) ranges
with one column: range
which is CIDR-like string, e.g. 192.168.128/28
with n rows. Ranges never overlap.
-------------------
| range |
-------------------
| 42.126.124.0/24 |
-------------------
| 2.36.127.0/24 |
------------------
| 12.59.78.0/23 |
-------------------
m = around 100M
n = around 65K
So both are large and m >> n
My goal is to find all IPs from ips
table that belong to any range in range
table.
My approach:
I created 2 intermediary table:
1) ip_num
with distinct numeric (INT64) ip
column computed from the ips table, ordered by ip
.
------------
| ip_num |
-----------
| 16753467 |
------------
| 16753469 |
------------
| 16753474 |
------------
2) ranges_num
with 2 columns: start_ip
and end_ip
(both INT64) which were calculated based on CIDR ranges. This column is ordered by start_ip
.
-----------------------
| start_ip | end_ip |
-----------------------
| 16753312 | 16753316 |
-----------------------
| 16753569 | 16753678 |
-----------------------
| 16763674 | 16763688 |
-----------------------
Both tables use numeric format because I hoped that comparing numbers has better performance. Conversion was done with NET.IPV4_TO_INT64 and NET.IP_FROM_STRING.
Producing these 2 tables is reasonably fast.
Now my last step is join on these 2 tables:
select ip from ip_num JOIN ranges_num ON ip BETWEEN start_ip AND end_ip;
This last query takes a lot of time (around 30 minutes) and then produces no results but I get Query exceeded resource limits
error, probably because it takes too long.
So my questions here are:
- Can I make it faster?
- Is my intuition correct that the last query generates n * m joins and query optimizer fails to leverage ordering of ranges_num, effectively generating O(n*m) complexity? If I had these 2 structures in memory of a program rather than in relational DB, it is relatively simple to write O(m+n) algorithm with 2 iterators, one on each table. But I don't know how to express it in standard SQL (if possible) and maybe built-in query optimizer should be able to do derive this algorithm automatically.
- Are there any tools for BigQuery (or any other) that would help me understand query optimizer?
I am not an SQL expert and I am new to BigQuery, so I'll be grateful for any tips.