2

I have a big .csv dataset containing 10e7 points with coordinates (latitude, longitude) representing locations of visitors. I have another dataset containing 10e3 points with coordinates representing locations of stores.

I want to associate to each visitor the closest store, using some kind of geodesic formula.

I want something really fast and efficient, which I can run on python (pandas for instance) or Google BigQuery.

Can someone give me a clue?

sweeeeeet
  • 1,769
  • 4
  • 26
  • 50

2 Answers2

4

To add to Felipe answer:

You can use SQL UDF vs JS UDF
JS UDF have some Limits that SQL UDF do not

So equivalent SQL UDF you can use with the rest of Felipe's code is

CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64 AS ((
WITH constants AS (
  SELECT 0.017453292519943295 AS p
) 
SELECT 12742 * ASIN(SQRT(
  0.5 - COS((lat2 - lat1) * p)/2 + 
  COS(lat1 * p) * COS(lat2 * p) * 
  (1 - COS((lon2 - lon1) * p))/2))
FROM constants
));

I tried to preserve layout of respective JS UDF as much as possible so you can see how it is created

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
3

This is a quick solution that finds the closest NOAA weather station for 21,221 cities in DBpedia (v2014).

#standardSQL

CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """

  var p = 0.017453292519943295;    // Math.PI / 180
  var c = Math.cos;
  var a = 0.5 - c((lat2 - lat1) * p)/2 + 
          c(lat1 * p) * c(lat2 * p) * 
          (1 - c((lon2 - lon1) * p))/2;

  return 12742 * Math.asin(Math.sqrt(a)); // 2 * R; R = 6371 km

""";

SELECT *
FROM (
  SELECT city, country_label, distance, name weather_station, country, 
    RANK() OVER(PARTITION BY city ORDER BY distance DESC) rank
  FROM (
    SELECT city, a.country_label, distance(a.lat,a.lon,b.lat,b.lon) distance, b.name, b.country
    FROM (
      SELECT rdf_schema_label city, country_label, country,
        CAST(REGEXP_EXTRACT(point, r'(-?\d*\.\d*)') as FLOAT64) lat, 
        CAST(REGEXP_EXTRACT(point, r' (-?\d*\.\d*)') as FLOAT64) lon 
      FROM `fh-bigquery.dbpedia2014temp.City`
      WHERE point!='NULL'
    ) a
    JOIN (
      SELECT name, country, usaf, wban, lat, lon
      FROM `bigquery-public-data.noaa_gsod.stations`
      WHERE lat != 0.0 AND lon !=0.0
    ) b
    ON CAST(a.lat as INT64)=CAST(b.lat as INT64)
    AND CAST(a.lon as INT64)=CAST(b.lon as INT64)
  )
)
WHERE rank=1

Caveats:

  • It uses the distance formula from https://stackoverflow.com/a/22476600/132438
  • Optimizes by restricting JOINs by only searching for stations in the same INT(lat),INT(lon) than the city. It's possible to improve this, but I'll leave that for another question.

enter image description here

Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325