Row number in BigQuery?

Question

Is there any way to get row number for each record in BigQuery? (From the specs, I haven't seen anything about it) There is a NTH() function, but that applies to repeated fields.

There are some scenarios where row number is not necessary in BigQuery, such as the use of TOP() or LIMIT function. However, I need it to simulate some analytical functions, such as a cumulative sum(). For that purpose I need to identify each record with a sequential number. Any workaround on this?

Thanks in advance for your help!

Leo

Felipe Hoffa · Answer 1 · 2018-10-03T06:51:13.500

2018 update: If all you want is a unique id for each row

#standardSQL
SELECT GENERATE_UUID() uuid
 , * 
FROM table

2018 #standardSQL solution:

SELECT
  ROW_NUMBER() OVER() row_number, contributor_username,
  count
FROM (
  SELECT contributor_username, COUNT(*) count
  FROM `publicdata.samples.wikipedia`
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)

But what about "Resources exceeded during query execution: The query could not be executed in the allotted memory. OVER() operator used too much memory.."

Ok, let's reproduce that error:

SELECT *, ROW_NUMBER() OVER() 
FROM `publicdata.samples.natality`

Yes - that happens because OVER() needs to fit all data into one VM - which you can solve with PARTITION:

SELECT *, ROW_NUMBER() OVER(PARTITION BY year, month) rn 
FROM `publicdata.samples.natality`

"But now many rows have the same row number and all I wanted was a different id for each row"

Ok, ok. Let's use partitions to give a row number to each row, and let's combine that row number with the partition fields to get an unique id per row:

SELECT *
  , FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality`

The original 2013 solution:

Good news: BigQuery now has a row_number function.

Simple example:

SELECT [field], ROW_NUMBER() OVER()
FROM [table]
GROUP BY [field]

More complex, working example:

SELECT
  ROW_NUMBER() OVER() row_number,
  contributor_username,
  count,
FROM (
  SELECT contributor_username, COUNT(*) count,
  FROM [publicdata:samples.wikipedia]
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)

How do we filter on that ROW_NUMBER column? (i.e. ROW_NUMBER() > 10 etc.) — Praxiteles, May 29 '16 at 19:57
Sub query. Please post new question for full answer, if needed! — Felipe Hoffa, May 30 '16 at 20:26

score 3 · Answer 2 · answered Nov 28 '18 at 09:02

Another HACK would be to go along the lines of:

SELECT *
FROM UNNEST(ARRAY(
    SELECT myColumn FROM myTable
)) AS myValue WITH OFFSET off

This gives you a resultset with 2 colums: myValue and off.

Benefit of this is that you could also use off in WHERE clauses create a non deterministic LIMIT, e.g. WHERE off < (SELECT SUM(amount) FROM mySecondTable)

Note that I do not consider this a viable alternative for large amounts of data. But it might suit your use case.

score 0 · Answer 3 · answered Jun 15 '12 at 20:59

0

We don't expose a row identifier. Can you simply add one to your data when you import it?

answered Jun 15 '12 at 20:59

Ryan Boyd

2,978
1
21
19

Thanks for your answer Ryan. Even we could import row identifier in our imports, it wouldn't be useful since we need the row number after applying a group function over the original data. – Leo Stefa Jun 18 '12 at 13:32
So you're looking for a result row #, not a row # that represents each row of the underlying data? – Ryan Boyd Jun 18 '12 at 23:59

score 0 · Answer 4 · answered Feb 05 '13 at 20:55

I thought maybe I could get around the lack of a ROW_NUMBER() function by joining a table to itself on a <= and then doing a count(*) on the results (which is how you do it sometimes in MySQL). Turns out, BigQuery only supports joins on straight-up "=".

Foiled again. I think this is impossible in BQ.

Prince · Answer 5 · 2020-05-09T10:59:14.610

I recently came upon this problem but my use case needed a continuous row number from start to end. Probably not ideal but leaving it here in case it can help someone.

I use a guide table with offsets for each partition to be added to all its rows. This offset is the sum count of rows in all it's preceding partitions.

select offset+ROW_NUMBER() OVER(PARTITION BY partitionDate) rowId
from `sample.example` input
left join
      (select partitions.partitionDate, partitions.count, SUM(duplicate.count)-partitions.count as offset
       from (
           select date(_PARTITIONTIME) partitionDate,COUNT(1) count 
           FROM `sample.example` 
           where date(_PARTITIONTIME) >= "2020-01-01" 
           group by _PARTITIONTIME) partitions
      inner join (
           select date(_PARTITIONTIME) partitionDate,COUNT(1) count 
           FROM `sample.example`
           where date(_PARTITIONTIME) >= "2020-01-01" 
           group by _PARTITIONTIME) duplicate 
      on partitions.partitionDate >= duplicate.partitionDate
      group by partitions.partitionDate, partitions.count
      order by partitions.partitionDate) guide
on date(_PARTITIONTIME) = guide.partitionDate
where date(_PARTITIONTIME) >= "2020-01-01" 
order by partitionDate

score 0 · Answer 6 · answered Jul 15 '22 at 02:25

I think, to avoid "Resources exceeded during query execution" while using OVER() with ORDER BY or PARTITION

SELECT *, ROW_NUMBER() OVER(row_number_partition) rn 
FROM `publicdata.samples.natality` 
WINDOW
row_number_partition AS
(PARTITION BY year, month)

score 0 · Answer 7 · answered Oct 11 '22 at 14:37

0

A simple query to add an increasing number to all your rows :)


SELECT ROW_NUMBER() OVER (PARTITION BY 'hola') as row_number, * 
FROM <table>

Of course this is a hack.

answered Oct 11 '22 at 14:37

caravana_942

632
1
8
26

Row number in BigQuery?

7 Answers7

Linked