0

I'm running a fairly complex set of nested aggregations in BQ and have run into the following issue:

If I ask for ROW_NUMBER() at the highest level of the query, BQ seems not to recognize my column names. If I ask for ROW_NUMBER() at the innermost level of the query, I magically get a row number.

Here's a (simplified) version of my query:

SELECT
  ROW_NUMBER() OVER() row_id, // query fails with ROW_NUMBER() here
  price,
  store,
  week,
  f.product product,
  COUNT(DISTINCT f.product) OVER (PARTITION BY store, week) product_count,
  price_max-price_min weekly_price_range,
FROM (
  SELECT
    MIN(price) OVER (PARTITION BY store, week) price_min,
    MAX(price) OVER (PARTITION BY store, week) price_max,
    *
  FROM (
    SELECT
      dollars / units price,
      *
    FROM store_sales_facts f
    JOIN product_facts p
    ON p.product = f.product 
  )
)

If I remove the request for ROW_NUMBER(), the query succeeds. If I include the ROW_NUMBER() function, BQ fails and says "Error: field 'price' not found"

I'm using the following as a workaround, but I'm curious if anyone can explain why this works while the above doesn't?

SELECT
  row_id,
  price,
  store,
  week,
  f.product product,
  COUNT(DISTINCT f.product) OVER (PARTITION BY store, week) product_count,
  price_max-price_min weekly_price_range,
FROM (
  SELECT
    MIN(price) OVER (PARTITION BY store, week) price_min,
    MAX(price) OVER (PARTITION BY store, week) price_max,
    *
  FROM (
    SELECT
    ROW_NUMBER() OVER() row_id, // query succeeds with ROW_NUMBER() here
      dollars / units price,
      *
    FROM store_sales_facts f
    JOIN product_facts p
    ON p.product = f.product 
  )
)
ORDER BY row_id

I understand the ROW_NUMBER() window function was added relatively recently. Is this a quirk of the function itself? Any help would be appreciated.

Nate Vaughan
  • 3,471
  • 4
  • 29
  • 47
  • 1
    both cases for row_number positioning (top select or most inner) works perfectly for me. i think while simplifying your example you lost something that might cause your issue – Mikhail Berlyant Mar 13 '16 at 06:56
  • 1
    Can you give more details on how the query is "broken"? What error do you get? Can you provide a project_id:job_id of an example failure? Thank you. – Michael Sheldon Mar 14 '16 at 19:11
  • Hi Michael, sure: shining-landing-763:bquijob_3135639c_1536dc17dee And here's one that succeeded with the change: shining-landing-763:bquijob_723bc19d_1536dd31a2b – Nate Vaughan Mar 14 '16 at 20:17

1 Answers1

1

On the assumption that "what we see is what you are executing", then I suggest the core problem may be the absence of table references against the columns. Both the fact and dimension tables share at least one column name (product) so you may have ambiguity due to that. Hence the second thing I would suggest is to alias the tables and use aliases and EVERY column reference, but that occurs of course after the first recommendation which is to replace select * with explicit column names (with the relevant table alias).

SELECT
      ROW_NUMBER() OVER () row_id
    , d.price
    , d.store
    , d.WEEK
    , d.product
    , COUNT(DISTINCT d.product) OVER (PARTITION BY d.store, d.WEEK) product_count
    , d.price_max - d.price_min weekly_price_range
FROM (
      SELECT
            MIN(p.price) OVER (PARTITION BY store, WEEK) price_min
          , MAX(p.price) OVER (PARTITION BY store, WEEK) price_max
          , p.product, p.store, p.week
      FROM (
            SELECT
                  f.dollars / f.units price
                , f.product
                , s.store
                , f.WEEK
            FROM store_sales_facts f
            JOIN product_facts p ON p.product = f.product
            ) p
      ) d

The specific issue of row_number() I can't replicate or even test, but at this previous question RANK or ROW_NUMBER in BigQuery over a large dataset there appears to be some question as to the scalability of that function. I would suggest perhaps you do need to consider some partitioning, or avoiding it altogether as i don't know what value it will give you. If you are after some unique reference for each row it would be more efficient to use the already unique row reference form the fact table (f.id I presume).

Another long-shot possibility with row_number() over() is that the over clause may require an order by argument even if you don't really care about the order of the resultant row numbering. This problem exists in some other dbs and is resolved by ordering on a constant such as "select null" or "select 1" e.g.

row_number() over(order by (select 1))

However I don't have evidence this problem does exist in BigQuery or is resolved by that workaround.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51