1

You have identified a solution of finding an area within a bounding box /circle using cross join as below:

SELECT A.ID, C.Car 
FROM Cars C 
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
  C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax

at: How to cross join in Big Query using intervals?

however, using cross join for large data sets is blocked by GBQ ops team due to constrains on the infrastructure.
Hence, my question: how could I find set of lat,longs within large data table (table A) that are within another set of bounding boxes , small(table B) ?

My query as below has been blocked:

select a.a1, a.a2 , a.mdl, b.name, count(1) count 
from TableMaster a 
CROSS JOIN places_locations b 
where (a.lat 
    BETWEEN  b.bottom_right_lat AND b.top_left_lat) 
AND (a.long 
    BETWEEN b.top_left_long AND b.bottom_right_long) 
group by ....

TableMaster is 538 GB with 6,658,716,712 rows (cleaned/absolute minimum) places_locations varies per query around 5 to 100kb.

I have tried to adapt fake join based on a template: How to improve performance of GeoIP query in BigQuery?

however, query takes an hour and does not produce any results nor any errors are displayed.

Could you identify a possible path to solve this puzzle at all?

Community
  • 1
  • 1
elcomendante
  • 1,113
  • 1
  • 11
  • 28
  • Do you have a sample of the data? I'm specially interested places_locations areas, to understand the topology of the problem. – Felipe Hoffa Mar 13 '15 at 11:27
  • sorry to trouble Jordi, it works now, last week it did not, perhaps it was due to the partial block release, thank you anyway – elcomendante Mar 13 '15 at 12:26

2 Answers2

2

The problem you're seeing is that the cross join generates too many intermediate values (6 billion x 1k = 6 trillion).

The way to work around this is to generate fewer outputs. If you have additional filters you can apply, you should try applying them before you do the join. If you could do the group by (or part of it) before the join, that would also help.

Moreover, for doing the lookup, you could do a more coarse-grained lookup first. That is, if you could do an initial cross join with a smaller table that has course grained regions, then you could join against the larger table on region id rather than doing a cross join.

Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63
  • Hi Jordan, thank you for the answer, I am still having issues to execute this query over GBQ, – elcomendante Apr 01 '15 at 20:25
  • The topology of the problem is that I need to extract from large table device ids and model on the condition that their latitude and longitude fit into bounding box of a smaller table. So, I was able to execute it using cross , then it was blocked by google due to resources over-usage, then I went around the problem using fake join, now the issue is that when smaller table is > 1000 (approx) , then query runs for few days and does not produce results. – elcomendante Apr 01 '15 at 20:47
  • The solution I am working now is to break smaller table into smaller tables with 1000 rows each, and here is another problem. I am trying to implement @FelipeHofa solution [http://stackoverflow.com/questions/11057219/row-number-in-bigquery?lq=1] but again no results, query keeps running for several hours – elcomendante Apr 01 '15 at 20:48
  • here is an example code: 'select a.DevID DeviceId, a.device_mk OS from (SELECT device_id DevID, device_mk, A, lat, long, is_gps from [GB_Data.PlacesMasterAMarch] .... – elcomendante Apr 01 '15 at 20:49
  • .... WHERE not device_id is null and is_gps is true) a JOIN ( select top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, ROW_NUMBER() OVER(ORDER BY A) from ( SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from [Karol.fast_food_box] where rn < 10 order by A )) b on a.A=b.A where (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long) group each by DeviceId, OS;' – elcomendante Apr 01 '15 at 20:50
  • Could you help to overcome this issue please? I find the only solution to select chunks of rows form smaller table, but cannot execute it at all. Thank you in advance. – elcomendante Apr 01 '15 at 20:50
  • 'select a.DevID DeviceId, a.device_mk OS from (SELECT device_id DevID, device_mk, A, lat, long, is_gps from [GB_Data.PlacesMasterAMarch] WHERE not device_id is null and is_gps is true) a JOIN ( select top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, ROW_NUMBER() OVER(ORDER BY A) from ( SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from [Karol.fast_food_box] where rn < 10 order by A )) b on a.A=b.A where (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long) group e by DeviceId, OS' – elcomendante Apr 01 '15 at 20:56
1

okey so fake join does work at the end, solution:

` select a.B, a.C , count(1) count from ( SELECT B,  C, A, lat, long from [GB_Data.PlacesMasterA] WHERE not B

is null) a JOIN (SELECT top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A from [Places.placeABOXA] ) b on a.A=b.A where (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat) AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long) group each by B, C `

elcomendante
  • 1,113
  • 1
  • 11
  • 28