4

Any ideas how to make this query return results on Google BigQuery? I'm getting a resources exceeded error... There are about 2B rows in the dataset. I'm trying to get the artist ID that appears the most for each user_id.

select user_id, artist, count(*) as count
from [legacy20130831.merged_data] as d
group each by user_id, artist
order by user_id ASC, count DESC
Jacob Gillespie
  • 3,981
  • 3
  • 23
  • 33

1 Answers1

7

An equivalent query on public data, that throws the same error:

SELECT actor, repository_name, count(*) AS count
FROM [githubarchive:github.timeline] AS d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc

Compare with the same query, plus a limit on the results to be returned. This one works (14 seconds for me):

SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
LIMIT 100

Instead of using a LIMIT, you could go through a fraction of the user_ids. In my case, a 1/3 works:

SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 3)  = 0
GROUP EACH BY actor, repository_name

But what you really want is "to get the artist ID that appears the most for each user_id". Let's go further, and get that:

SELECT actor, repository_name, count FROM (
  SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM (
    SELECT actor, repository_name, count(*) as count
    FROM [githubarchive:github.timeline] as d
    WHERE ABS(HASH(actor) % 10) = 0
    GROUP EACH BY actor, repository_name
))
WHERE rank=1

Note that this time I used %10, as it gets me results faster. But you might be wondering "I want to get my results with one query, not 10".

There are 2 things you can do for that:

  • Unioning the partitioned tables (comma in the FROM expression does an union, not a join in BigQuery).
  • If you are still exceeding resources, you might need to materialize the table. Run the original query and save the result to a new table. Run the RANK() algorithm over that table, instead of over an in-memory GROUP.

If you are willing to share your dataset with me, I could provide dataset specific advice (a lot depends on cardinality).

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • So, for the last query, that gets 1/10 of the results, and then we'd combine it with other queries to get the other 9/10? Or does that get them all? – Jacob Gillespie Sep 03 '13 at 21:28
  • Yes, the last query here only gets 1/10th of the result. Or a 1/3 if you replace 10 with 3. The fine tuning depends a lot on the cardinality of your data. As a guess (without being able to see the data), the most cost effective way would be to output the SELECT/FROM/WHERE/GROUP to a temporary table, and run the RANK() over it. – Felipe Hoffa Sep 03 '13 at 22:08
  • If you want, I'd be happy to share the dataset - modding by 20 instead of 10 still isn't small enough, apparently. I think I can share by email address - mine is jacobwgillespie@gmail.com, if you'd like to get in touch. – Jacob Gillespie Sep 04 '13 at 00:24
  • Yeah, modding by 50 isn't working either... There's 2.1 billion rows in there... This is the query I was using: SELECT user_id, artist, count FROM ( SELECT user_id, artist, count, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY count DESC) rank FROM ( SELECT user_id, artist, count(*) as count FROM [legacy20130831.merged_data] as d WHERE user_id % 50 = 0 GROUP EACH BY user_id, artist )) WHERE rank=1 – Jacob Gillespie Sep 04 '13 at 01:25
  • 2
    I don't mean to thread-hijack, but the issue is the ORDER BY has too many rows -- it is (currently) a serial operation, run only on a single machine per query. It sounds like you don't actually care about the order, you just want to get the per-user rows together. One way of doing this is via the NEST operation. I'll follow up with Fh with a suggestion. – Jordan Tigani Sep 04 '13 at 02:13