0

I'm trying to match users to their geo locations using their IPs.

IPs is a mix of ipv4, ipv6 and some rows with invalid entries.

I used instructions from this article by Felipe Hoffa https://towardsdatascience.com/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 to match my IPs with the IPs in the blocks file.

The problem is I get an error when trying to use the NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,mask) functions. The error is:

Bitwise binary operator for BYTES requires equal length of the inputs. Got 16 bytes on the left hand side and 4 bytes on the right hand side.

So I tried to ensure IPs are valid before passing them to the function using regex. This works for ipv4. I've now figured out regex to match ipv6 as well, and from my checks, it seems accurate. However I still get the error. I can't figure out why and how to fix my query to get the correct result.

See whole query below:

(
    SELECT
      *,
      NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,
        mask) network_bin
    FROM (
      SELECT
        * EXCEPT (is_valid)
      FROM (
        SELECT
          *,
          CASE
            WHEN (REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){7}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9](:|\z)){8}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?::([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?\z')) OR REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){5}:(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9]:){6}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,4})?::([a-f0-9]{1,4}:){0,5}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z')) ) THEN TRUE
            WHEN REGEXP_CONTAINS(ip_address, r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$") THEN TRUE
          ELSE
          FALSE
        END
          AS is_valid
        FROM (
          SELECT
            user,
            ip_address,
            date
          FROM 
            `project.dataset.table`)
      WHERE
        is_valid IS TRUE),
      UNNEST(GENERATE_ARRAY(9,32)) mask)
evam
  • 45
  • 1
  • 6

1 Answers1

1

probably the issue is that NET.SAFE_IP_FROM_STRING returns 4 bytes for IPv4 and 16 bytes for IPv6. So & NET.IP_NET_MASK(4, mask) is ok for IPv4, but for IPv6 you need to use & NET.IP_NET_MASK(16, mask). One of the options would be to introduce something like is_ipv4 and use it like:

IF(
  is_ipv4,
  NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4, mask),
  NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(16, mask)
) network_bin
Sergey Geron
  • 9,098
  • 2
  • 22
  • 29
  • Based on this doc https://cloud.google.com/bigquery/docs/reference/standard-sql/net_functions#netip_from_string I replace is_ipv4 with `ip_address LIKE '%.%'`. Thank you for the answer, works for me. – khusnanadia Sep 08 '21 at 05:01