I'm trying to match users to their geo locations using their IPs.
IPs is a mix of ipv4, ipv6 and some rows with invalid entries.
I'm using GeoLite2-City-Blocks-IPv4 (know I will have to use a different file for ipv6, if anybody knows which one is the correct one, much appreciated) and GeoLite2-City-Locations-en. So basically, one file has the IP blocks and their location codes and the other one have the actual locations for those codes.
I used instructions from this article by Felipe Hoffa https://towardsdatascience.com/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 to match my IPs with the IPs in the blocks file.
The problem is I get an error when trying to use the NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,mask) functions. The error is:
Bitwise binary operator for BYTES requires equal length of the inputs. Got 16 bytes on the left hand side and 4 bytes on the right hand side.
So I tried to ensure IPs are valid before passing them to the function using regex. This works for ipv4. I've now figured out regex to match ipv6 as well, and from my checks, it seems accurate. However I still get the error. I can't figure out why and how to fix my query to get the correct result.
See whole query below:
(
SELECT
*,
NET.SAFE_IP_FROM_STRING(ip_address) & NET.IP_NET_MASK(4,
mask) network_bin
FROM (
SELECT
* EXCEPT (is_valid)
FROM (
SELECT
*,
CASE
WHEN (REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){7}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9](:|\z)){8}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?::([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,6})?\z')) OR REGEXP_CONTAINS(ip_address, r'\A[a-f0-9]{1,4}(:[a-f0-9]{1,4}){5}:(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z') OR (NOT REGEXP_CONTAINS(ip_address, r'\A(.*?[a-f0-9]:){6}') AND REGEXP_CONTAINS(ip_address, r'\A([a-f0-9]{1,4}(:[a-f0-9]{1,4}){0,4})?::([a-f0-9]{1,4}:){0,5}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}\z')) ) THEN TRUE
WHEN REGEXP_CONTAINS(ip_address, r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$") THEN TRUE
ELSE
FALSE
END
AS is_valid
FROM (
SELECT
user,
ip_address,
date
FROM
`project.dataset.table`)
WHERE
is_valid IS TRUE),
UNNEST(GENERATE_ARRAY(9,32)) mask)