14

I'm trying to run a query joining 2 large sets of data and I'm hitting the resources exceeded during query execution error. I've read that there are work around when using Join Each and Group Each but not what those workaround would be.

SELECT 
  year(users.firstseen) as first_year,
  month(users.firstseen) as first_month, 
  DATEDIFF(orders.timestamp,users.firstseen) as days_elapsed,
  count(orders.user_key) as count_orders
FROM 
  [project.orders] as orders
JOIN EACH
  [project.users] AS users
ON
  orders.user_key = users.user_key
WHERE orders.store = 'ios'
GROUP EACH BY 1,2,3

Edit: the following worked:

SELECT
  year(users.firstseen) as firstyear,
  month(users.firstseen) as firstmonth,
  DATEDIFF(orders.timestamp, users.firstseen) as days_elapsed,
  COUNT(users.firstseen) AS count_orders FROM [project.orders] as orders
JOIN EACH( SELECT user_key, firstseen FROM [project.users]
WHERE store_key = 'ios') as users ON orders.user_key = users.user_key
GROUP BY firstyear, firstmonth, days_elapsed
ORDER BY firstyear, firstmonth, days_elapsed
Kristian
  • 21,204
  • 19
  • 101
  • 176
user2388120
  • 143
  • 1
  • 1
  • 4
  • 1
    this ended up working SELECT year(users.firstseen) as firstyear, month(users.firstseen) as firstmonth, DATEDIFF(orders.timestamp, users.firstseen) as days_elapsed, COUNT(users.firstseen) AS count_orders FROM [project.orders] as orders JOIN EACH( SELECT user_key, firstseen FROM [project.users] WHERE store_key = 'ios') as users ON orders.user_key = users.user_key GROUP BY firstyear, firstmonth, days_elapsed ORDER BY firstyear, firstmonth, days_elapsed – user2388120 May 17 '13 at 05:52

1 Answers1

13

JOIN EACH can fail if your join keys (in this case, user_key) are unevenly distributed. For example, if you have one user_key that appears abnormally often, you'll get a "resources exceeded" error from the node that handles that key. Alternatively, you could try running the query over a smaller set of user keys by filtering out some portion of the user keys before the join.

GROUP EACH BY can fail if you have too many distinct group keys. You could try whittling down the join output by adding a few more WHERE clauses in order to see if this is the case.

Basically, I'd recommend whittling down the inputs to either the JOIN EACH or the GROUP EACH BY until you get the query to work, and then you'll have a better sense for the limits you're running up against. Once you know that, you can (hopefully) structure your queries to get the most out of the available resources.

(BTW, we expect to tune these operations in the near future to remove some of the limits you may be hitting!)

Jeremy Condit
  • 6,766
  • 1
  • 28
  • 30
  • It looks like having a JOIN EACH that yields 0 matches also triggers this error (tested on a join 2 million left, 180 million right). – Mo'in Creemers Mar 07 '14 at 09:28
  • A JOIN EACH with zero matches is unlikely to be the culprit on its own. More likely, there is some other clause in your query either before or after the JOIN EACH that is exceeding the resources available for a given worker node. If you send a job ID to jcondit@google.com, I can take a closer look. – Jeremy Condit Mar 07 '14 at 17:02
  • @JeremyCondit Do you have an update on the constraints? – Nils Ziehn Sep 09 '15 at 15:56
  • 2
    In progress! GROUP BY is already updated, and will likely do a better job of dynamically partitioning your data than GROUP EACH BY. We're still working out a few kinks with JOIN, but it'll be updated soon as well. – Jeremy Condit Sep 09 '15 at 16:28
  • 1
    Hi, my order by query also failed `SELECT * FROM [kaggle_bank_raw.cleaned_train] order by ncodpers` could you take a look? Thanks! – Charles Chow Nov 02 '16 at 06:33