Google Big Query: How to select subset of smaller table using nested fake join?

Question

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table. I wanted to use solution by @FelipeHoffa using row_number function Row number in BigQuery? I have created nested query as follows:

SELECT a.DevID DeviceId,
       a.device_make OS
FROM
  (SELECT device_id DevID, device_make, A, lat, long, is_gps
   FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from  [Karol.fast_food_box]
   GROUP BY (....?)
   ORDER BY COUNT DESC,
   WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
  AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
              OS

Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.

score 1 · Answer 1 · answered Apr 03 '15 at 17:36

It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.

So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.

What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

Master.Table is for one country only, 47 billion rows and 6 cols, it contains signal records, with geo-point. Now, smaller table is only a geo-box. Cross join generates too many intermediate values (6 billion x 1k = 6 trillion), hence, if smaller table is more than 1k rows, neither can I use fake join on String A as "a" on both tables (no results, query keeps running for several hours), nor cross join (blocked by Google). Working Solution is 1) break manually small table (geobox) into smaller, 2) join both on fake to overcome banned cross join. — elcomendante, Apr 04 '15 at 19:07
so I i want to find record that are related to area A, i use 1 x row (bounding-box BB) or built in BB function, but if I have 30k areas, it is problematic. Solution 2) is a must, sollution 1) is to produce less intermediate values, — elcomendante, Apr 04 '15 at 19:32

Google Big Query: How to select subset of smaller table using nested fake join?

1 Answers1