3

...Spent several hours trying what not and researching this forum. Quite pessimistic at this point about the usefulness of Google Big Query (GBQ) for anything more than trivial queries, but here is one last desperate try, maybe someone has better ideas:

Let's say we have a COUNTRY table with average population weight(in kilograms) and height (in meters) per country as follows:

  country   |  continent   | weight | height |
  ============================================
  US        | America      | 200    | 2.00   |
  Canada    | America      | 170    | 1.90   |
  France    | Europe       | 160    | 1.78   |
  Germany   | Europe       | 110    | 2.00   |

Let's say you want to pick out and live in the European country with "smallest" people, where you define the measure "smallness" as the weighted sum of body weight and height with some constant weights, such as 0.6 for body weight and 0.4 for body height.

In Oracle or MS SQL server this can be done elegantly and compactly by using analytic window functions such as rank() and row_number(), for example:

select country, combined_score
from (select 
        country
       ,( 0.6*rank(weight) over()  + 0.4*rank(height) over() ) combined_score
from country
where continent = 'Europe')
order by combined_score

Note that the ranking is done after the filtering for continent. The continent filter is dynamic (say input from a web form), so the ranking can not be pre-calculated and stored in the table in advance!

In GBQ there are no rank() , row_number() or over(). Even if you try some "poor man" hacks it is still not going to work because GBQ does not support correlated queries. Here are similar attempts by other people with some pretty unsatisfactory and inefficient results:

BigQuery SQL running totals

Row number in BigQuery?

Any ideas how this can be done? I can even restructure the data to use nested records, if it helps. Thank you in advance!

Community
  • 1
  • 1
  • There are many statements in this question that I'd like to go over before getting to the answer you are looking for. But first, a question: Why 0.6*rank(weight) + 0.4*rank(height) and not 0.6*weight + 0.4*height? Normalizing? The problem of this formula is that if a country is the smallest in height, it will get the same number, regardless if the height is 100cm or 50cm. In the described scenario, this should matter. – Felipe Hoffa Apr 15 '13 at 03:42
  • 1
    Fh,Yes normalizing is the reason.If you have N countries, then the ranking for each of the measures would be 1..10, so 0.6*rank(weight) + 0.4*rank(height) will disregard the unit of measure and weight the pure rankings. In contrast, 0.6*weight + 0.4*height will take into effect the magnitude of the unit of measures, so for example, if you were measuring height in millimeters (2m = 2000mm), the height will dwarf the weight. – user2264178 Apr 16 '13 at 03:11
  • Fh,Imagine a country is the smallest in height but the largest in weight and you take the two measures with equal weight (.5). Then this would put that country somewhere in the middle in the tally of combined scores, perhaps not the best pick. the example above is trivial, but you can imagine a table with 10 measures and 1000s of records. My point is that this is classic analytical procesing that GBQ fails to provide in any way (unless you can point out how this can be done and prove me wrong) – user2264178 Apr 16 '13 at 03:19

2 Answers2

1

In your specific example, I think you can compute the result without using RANK and OVER at all:

SELECT country, score
FROM (SELECT country, 0.6 * weight + 0.4 * height AS score
      FROM t WHERE continent = 'Europe')
ORDER BY score;

However, I'm assuming that this is a toy example and that your real problem involves use of RANK more in line with your example query. In that case, BigQuery does not yet support analytic functions directly, but we'll consider this to be a feature request. :-)

Jeremy Condit
  • 6,766
  • 1
  • 28
  • 30
1

An equivalent for RANK in BigQuery is row_number().

For example, the top 5 contributors to Wikipedia, with row_number giving their place:

SELECT
  ROW_NUMBER() OVER() row_number,
  contributor_username,
  count,
FROM (
  SELECT contributor_username, COUNT(*) count,
  FROM [publicdata:samples.wikipedia]
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325