1

I keeping getting resources exceeded for the following query. I've tried running in batch mode and from the command line and nothing seems to be working. Anyone have ideas?

SELECT num, extid, amount, note, balance FROM ( SELECT row_number() over(partition by extid order by stamp) as num , extid, stamp, ds, amount, note, balance FROM monte.ledger2_trailing_21d WHERE ds >= '2015-02-09' ORDER BY extid, stamp ) WHERE num <= 10 limit 300

PData
  • 83
  • 5

1 Answers1

4

This is a deceptively expensive query; timeseries analysis is always hard in SQL-like environments. The PARTITION BY clause you have written requires all of the data for a single extid to be present in memory on a single machine, which is overloading it and causing your resources exceeded error.

You can mitigate this RAM requirement by having a ROWS clause to limit the scope of your partition. Here is an example:

SELECT extid, stamp, ds, amount, note, balance
FROM (
  SELECT
    extid, stamp, ds, amount, note, balance,
    MAX(tenth_stamp) OVER(PARTITION BY extid) AS target_stamp
  FROM (
    SELECT extid, stamp, ds, amount, note, balance,
      MIN(stamp) OVER (
        PARTITION BY extid
        ORDER BY stamp DESC
        ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING
      ) AS tenth_stamp
    FROM
      [monte.ledger2_trailing_21d]
    WHERE ds >= '2015-02-09'
  )
)
WHERE stamp >= target_stamp
ORDER BY extid, stamp
LIMIT 300

The inner-most sub-select extracts your data and a field tenth_stamp that holds the least stamp of the 10 rows examined. Using MIN() makes this work even when there are fewer than 10 rows for any given extid.

The middle sub-select finds the largest tenth_stamp for each extid. This is the tenth total stamp for that extid. The outer SELECT can then restrict the result to only rows with a stamp within the ten most recent stamp for their respective extid, giving you the desired result.

When executed, this takes a total of 4 stages. It will not run fast, but never requires large amounts of data in a single location. Hope that helps!

Community
  • 1
  • 1
Matthew Wesley
  • 616
  • 3
  • 5
  • Thanks for pointing out the Rows clause. I still needed to do row_number in an outer query in order to analyze the order in which users interact with events, but it runs without a problem now. In addition, could you please edit out my account info from your answer. I'd prefer to not have the full file path listed in a public forum. – PData Feb 14 '15 at 00:48
  • FYI Matthew is a BigQuery engineer with access to the query logs, in case you were wondering how he found your project ID (and inadvertently added it to the response). – Jordan Tigani Feb 18 '15 at 16:14