3

So I have read https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html

But I was wondering if there was a #standardSQL way of doing it. So far, I have a lot of challenge converting PARSE_IP and NTH() since the suggested changes in the migration docs have limitations.

Going from PARSE_IP(contributor_ip) to NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) does not work for IPv6 IP addresses.

Going from NTH(1, latitude) lat to latitude[SAFE_ORDINAL(1)] does not work since latitude is considered a string.

And there might be more migration problems that I have yet to encounter. Does anyone know how to transform IP addresses into geolocation in BigQuery standard SQL?

P.S. How would I go from geolocation to determining timezone?

edit: So what is the difference between this

#legacySQL
SELECT
  COUNT(*) c,
  city,
  countryLabel,
  NTH(1, latitude) lat,
  NTH(1, longitude) lng
FROM (
  SELECT
    INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
    INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
  FROM
    [publicdata:samples.wikipedia]
  WHERE
    contributor_ip IS NOT NULL ) AS a
JOIN EACH
  [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
  a.classB = b.classB
WHERE
  a.clientIpNum BETWEEN b.startIpNum
  AND b.endIpNum
  AND city != ''
GROUP BY
  city,
  countryLabel
ORDER BY
  1 DESC

and

SELECT
  COUNT(*) c,
  city,
  countryLabel,
  ANY_VALUE(latitude) lat,
  ANY_VALUE(longitude) lng
FROM (
  SELECT
    CASE
      WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
      ELSE NULL
    END AS clientIpNum,
    CASE
      WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64) 
      ELSE NULL
    END AS classB
  FROM
    `publicdata.samples.wikipedia`
  WHERE
    contributor_ip IS NOT NULL ) AS a
JOIN
  `fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
  a.classB = b.classB
WHERE
  a.clientIpNum BETWEEN b.startIpNum
  AND b.endIpNum
  AND city != ''
GROUP BY
  city,
  countryLabel
ORDER BY
  1 DESC

edit2: Seems like I manage to figure out the problem via not casting a float correctly. Right now, the standard SQL returns 41815 rows instead the 56347 rows from the legacy SQL which may be due to the lack of conversion from IPv6 to int for standard SQL, but it might be due to something else. Also the legacy SQL query performs much better, running at about 10 seconds instead of the full minute from the standard SQL.

Flair
  • 2,609
  • 1
  • 29
  • 41
  • `NTH` is the analogue of `arr[OFFSET(0)]`. If you don't have an array, then you don't need to use `ORDINAL` or `OFFSET`. You could filter IPs using `LENGTH(...)` if you only want IPv4. – Elliott Brossard Sep 05 '17 at 21:38
  • `Error: Element access using [] is not supported on values of type STRING` – Flair Sep 05 '17 at 21:50
  • That's what I'm saying--you have a string and not an array, so you don't need to use that. – Elliott Brossard Sep 05 '17 at 21:53
  • Then what would I do? – Flair Sep 05 '17 at 22:02
  • @ElliottBrossard If I simply use latitude, then I get Error: SELECT list expression references column latitude which is neither grouped nor aggregated – Flair Sep 05 '17 at 22:16
  • 1
    You can add it to the group by list or use `ANY_VALUE`. – Elliott Brossard Sep 05 '17 at 23:08
  • from what I understand the role of NTH in legacy version was to provide sort of aggregation function, similar to MIN or MAX - which I believe you can try instead of NTH and you will get exactly same output! So you can use MAX, MIN or ANY_VALUE in your standard SQL version – Mikhail Berlyant Sep 05 '17 at 23:13
  • 1
    For the time zone question, see https://stackoverflow.com/questions/16086962/how-to-get-a-time-zone-from-a-location-using-latitude-and-longitude-coordinates – Matt Johnson-Pint Sep 06 '17 at 01:09
  • @Flair for completeness it is recommended that you post the solution that you used as the answer to this question to better help the community. – Jordan Sep 20 '17 at 17:06
  • FYI, I have an upcoming improved solution to this problem. Please ping me if I don't update this answer soon :) – Felipe Hoffa Jan 24 '19 at 21:59

2 Answers2

2

According to https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7

#standardSQL
SELECT
  id,
  IFNULL(city, 'Other') AS city,
  IFNULL(countryLabel, 'Other') AS countryLabel,
  latitude,
  longitude
FROM (
  SELECT
    id,
    NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
    TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
  FROM
    `<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
  `fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
  a.classB = b.classB
  AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
  id ASC
Ramtin M. Seraj
  • 686
  • 7
  • 17
0

The answer to this question is not valid for ipv6 addresses.

Following the approach described here https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2 I came up with this solution:

WITH test_data AS (
    SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL 
    SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
        FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v4`
    USING (network_bin, mask)
), ipv6 AS (
    SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
    FROM test_data 
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
    SELECT ip, city_name, country_name, latitude, longitude
    FROM (
        SELECT  ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
        FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
    )
    JOIN `demo_bq_dataset.geoip_city_v6`  
    USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL 
SELECT * FROM ipv6d

In order to get the geoip_city_v4 and geoip_city_v6 you need to download the geoip database from https://maxmind.com/

You can follow this tutorial in to update and prepare you dataset hodo.dev/posts/post-37-gcp-bigquery-geoip.