0

I am looking for help with optimizing my BigQuery query.

I have 2 tables:

1) ips with one column: ip which is IP-like string, e.g. 192.168.4.6. It has m rows

----------------
|   ip         |
----------------
| 73.14.170.37 |
----------------
| 5.14.121.34  |
----------------
| 61.22.122.67 |
---------------

2) ranges with one column: range which is CIDR-like string, e.g. 192.168.128/28 with n rows. Ranges never overlap.

-------------------
|      range      |
-------------------
| 42.126.124.0/24 |
-------------------
| 2.36.127.0/24   |
------------------
| 12.59.78.0/23   |
-------------------

m = around 100M

n = around 65K

So both are large and m >> n

My goal is to find all IPs from ips table that belong to any range in range table.

My approach:

I created 2 intermediary table:

1) ip_num with distinct numeric (INT64) ip column computed from the ips table, ordered by ip.

------------
|  ip_num  |
-----------
| 16753467 |
------------
| 16753469 |
------------
| 16753474 |
------------

2) ranges_num with 2 columns: start_ip and end_ip (both INT64) which were calculated based on CIDR ranges. This column is ordered by start_ip.

-----------------------
| start_ip |  end_ip  |
-----------------------
| 16753312 | 16753316 |
-----------------------
| 16753569 | 16753678 | 
-----------------------
| 16763674 | 16763688 |
-----------------------

Both tables use numeric format because I hoped that comparing numbers has better performance. Conversion was done with NET.IPV4_TO_INT64 and NET.IP_FROM_STRING.

Producing these 2 tables is reasonably fast.

Now my last step is join on these 2 tables:

select ip from ip_num JOIN ranges_num ON ip BETWEEN start_ip AND end_ip;

This last query takes a lot of time (around 30 minutes) and then produces no results but I get Query exceeded resource limits error, probably because it takes too long.

So my questions here are:

  • Can I make it faster?
  • Is my intuition correct that the last query generates n * m joins and query optimizer fails to leverage ordering of ranges_num, effectively generating O(n*m) complexity? If I had these 2 structures in memory of a program rather than in relational DB, it is relatively simple to write O(m+n) algorithm with 2 iterators, one on each table. But I don't know how to express it in standard SQL (if possible) and maybe built-in query optimizer should be able to do derive this algorithm automatically.
  • Are there any tools for BigQuery (or any other) that would help me understand query optimizer?

I am not an SQL expert and I am new to BigQuery, so I'll be grateful for any tips.

realr
  • 3,652
  • 6
  • 23
  • 34
Danio
  • 1,064
  • 1
  • 11
  • 25
  • Can you provide more info on your source tables? There are a few items in your description that are unclear. Perhaps create some dummy data in a select statement( ) or as a formatted table.
    – rtenha Jul 23 '19 at 19:02
  • @rtenha I added columns with sample data. i hope it helps – Danio Jul 23 '19 at 21:51
  • 1
    Seems your question has been asked before... https://stackoverflow.com/questions/43992006/efficiently-joining-ip-ranges-in-bigquery – rtenha Jul 23 '19 at 22:26

2 Answers2

2

Thanks to @rtenha, I followed the link to almost the same problem with solution: Efficiently joining IP ranges in BigQuery I am attaching a more complete query which I derived from that one and it worked for me under 10 seconds:

(SELECT ip FROM `ips` i
JOIN `ranges` a
ON NET.IP_TRUNC(a.start_ip, 16) = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 16)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip 
AND mask >= 16)
UNION ALL
(
SELECT ip FROM `ips` i
JOIN `ranges`  a
ON NET.IP_TRUNC(a.start_ip, 8) = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 8)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip
AND mask BETWEEN 8 AND 15)
UNION ALL
(
SELECT ip FROM `ips` i
JOIN `ranges`  a
ON NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip
AND mask < 8)

Here I split the ranges in 3 sections: those with 16-bits network prefixes or longer, between 8 and 15, and below 8. For each section I then applied extra prefix comparison which has much better performance and it enables to filter effectively data and then execute second comparison (BETWEEN) on much smaller sets which is quick. The last section does not have prefix matching because it targets the shortest network prefixes. After this, I combine all the sections with UNION.

One of the reason it works is that 99% of my network prefixes was 16 or longer. The smaller network prefixes were longer to process each, but it was compensated by the fact that there were very few of them and further mitigated by braking the short-network-prefix section (16 or shorter) in two smaller sections. It might be possible to optimize further by braking the data into even more fine-grained sections (e.g. 32 sections, one for each possible mask length). However I was satisfied with my results anyway.

I did not analyze whether INT or BYTES were more optimal data type for processing, but having intermediate tables did not bring noticeable performance improvement.

Danio
  • 1,064
  • 1
  • 11
  • 25
0

According to this article you have ordered the tables right.

Although sometimes I've experienced that BigQuery underestimates the needed calculation capacity (slots), which makes the query slow or throws a quote exceeded error, as in your case.

For me it helped when I switched the order of the tables and put the smallest one of the left side of join.

E.g.:

select ip from ranges_num JOIN ip_num ON ip BETWEEN start_ip AND end_ip;
jszule
  • 414
  • 2
  • 4