2

I have a table of IP addresses and a table of IP address ranges (start ip, end ip) that I'd like to join together. I've been able to make this work with the following query:

SELECT * FROM `ips` i
JOIN `ranges`  a
ON NET.SAFE_IP_FROM_STRING(i.ip) 
BETWEEN NET.SAFE_IP_FROM_STRING(a.start_ip) 
AND NET.SAFE_IP_FROM_STRING(a.end_ip)

The problem I'm having is that it scales really badly. To do it for 10 IPs takes around 8 seconds, 100 takes 30 seconds, and 1000 takes a few minutes. I'd like to be able to do this for tens of millions of rows. (I have tried writing the output of NET.SAFE_IP_FROM_STRING to the ranges table, but it only speeds things up by around 10%, and doesn't help with scaling).

The ranges don't overlap, so for every row in the input table I expect 0 or 1 rows in the output table. A LATERAL JOIN would let me do that and almost certainly speed things up, but I don't think BigQuery supports them. Is there any other way to make this query faster and scalable?

Ben Dowling
  • 17,187
  • 8
  • 87
  • 103

2 Answers2

7

After reviewing the article at https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html that was linked to in Felipe's answer I was able to put something together that is incredibly fast and scales really well. As Felipe alluded to, the trick is to do a direct join on a prefix (I went with /16), and then filter with a between. I'm pre-processing the ranges to split anything larger than a /16 into multiple blocks. I then overwrite the table with this query, which adds some additional fields:

SELECT *, 
NET.SAFE_IP_FROM_STRING(start_ip) AS start_b, 
NET.SAFE_IP_FROM_STRING(end_ip) AS end_b, 
NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(start_ip), 16) as prefix

The join query then looks something like this:

SELECT * FROM `ips` i
JOIN `ranges`  a
ON a.prefix = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 16)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_b AND a.end_b

Joining 10 million IPs to 1 million ranges now takes less than 30 seconds at billing tier 1!

Ben Dowling
  • 17,187
  • 8
  • 87
  • 103
1

I did something like this on https://stackoverflow.com/a/20156581

I'll need to update my queries for #standardSQL, but the basic secret is generating a smaller JOIN area.

If you can share a sample dataset, I'll be happy to provide a new query.

Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325