0

I'm trying to match users to their geo locations using their IPs.

IPs is a mix of ipv4, ipv6 and some rows with invalid entries.

I'm using GeoLite2-City-Blocks-IPv4 (know I will have to use a different file for ipv6, if anybody knows which one is the correct one, much appreciated) and GeoLite2-City-Locations-en. So basically, one file has the IP blocks and their location codes and the other one have the actual locations for those codes.

I used instructions from this article by Felipe Hoffa https://towardsdatascience.com/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 to match my IPs with the IPs in the blocks file.

The problem is I get an error when trying to use the NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,mask) functions. The error is:

Bitwise binary operator for BYTES requires equal length of the inputs. Got 16 bytes on the left hand side and 4 bytes on the right hand side.

So I tried to ensure IPs are valid before passing them to the function using regex. This works for ipv4. I've now figured out regex to match ipv6 as well, and from my checks, it seems accurate. However I still get the error. I can't figure out why and how to fix my query to get the correct result.

See whole query below:

(
    SELECT
      *,
      NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,
        mask) network_bin
    FROM (
      SELECT
        * EXCEPT (is_valid)
      FROM (
        SELECT
          *,
          CASE
            WHEN (REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){7}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9](:|\z)){8}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?::([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?\z')) OR REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){5}:(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9]:){6}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,4})?::([a-f0-9]{1,4}:){0,5}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z')) ) THEN TRUE
            WHEN REGEXP_CONTAINS(ip_address, r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$") THEN TRUE
          ELSE
          FALSE
        END
          AS is_valid
        FROM (
          SELECT
            user,
            ip_address,
            date
          FROM 
            `project.dataset.table`)
      WHERE
        is_valid IS TRUE),
      UNNEST(GENERATE_ARRAY(9,32)) mask)
evam
  • 45
  • 1
  • 6

2 Answers2

0

Question was answered here: https://stackoverflow.com/a/65403033/12675934

In summary: The issue is that NET.SAFE_IP_FROM_STRING returns 4 bytes for IPv4 and 16 bytes for IPv6. So & NET.IP_NET_MASK(4, mask) is ok for IPv4, but for IPv6 you need to use & NET.IP_NET_MASK(16, mask).

evam
  • 45
  • 1
  • 6
0

There is same answer as for this question How to improve performance of GeoIP query in BigQuery?

WITH test_data AS (
    SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL 
    SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
        FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v4`
    USING (network_bin, mask)
), ipv6 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT  ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
        FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v6`  
    USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL 
SELECT * FROM ipv6d

In order to get the geoip_city_v4 and geoip_city_v6 you need to download the geoip database from https://maxmind.com/

You can follow this tutorial in to update and prepare you dataset https://hodo.dev/posts/post37-gcp-bigquery-geoip/.