0

I have a dataset of around 50k records. The important fields are "date", "name" and "project ID". My challenge is I need to pull the first 5 projects for each person. There are around 500 people in the dataset. I know how to get the earilest projects, or the earliest projects for a specific individual, but I don't know how to get all the data for the earliest projects for all individuals. Any suggestions?

My desired output would be something like this:

NAME, DATE, PROJECT_ID
sally, 12/2/2020, 12432
sally, 12/6/2020, 13433
sally, 12/16/2020, 14453
sally, 1/8/2021, 14468
sally, 2/6/2021, 15496
jimmy, 11/15/2020, 24531
jimmy, 3/21/2021, 42322
jimmy, 5/13/2021, 44332

...

I'm using SQL in Google Big Query. I haven't gotten far because the logic is baffling me. And as you may be able to tell, my sql skills are elementary at best - but I'm trying! :)

Ryan Wilson
  • 10,223
  • 2
  • 21
  • 40
BennieB
  • 11
  • 3
  • Does your table have a user id? Partitioning in the answer i provided by name is open to bad data if more than one user has the same name. – Ryan Wilson Dec 01 '22 at 19:15
  • Yes, I do have a user ID. I plugged this in and it works perfectly. And using the User ID does make 100% sense. Thank you so much! I have much to learn, but you really helped move this along! Thank you! – BennieB Dec 01 '22 at 19:42
  • Does this answer your question? [Get top n records for each group of grouped results with Bigquery (standard SQL)](https://stackoverflow.com/questions/44680464/get-top-n-records-for-each-group-of-grouped-results-with-bigquery-standard-sql) – nbk Dec 01 '22 at 19:48
  • @RyanWilson Apparently I'm too new to upvote. Sorry! – BennieB Dec 01 '22 at 21:32

1 Answers1

0

You can use the ROW_NUMBER function with the PARTITION BY clause as a subquery, like so, and get only records with a row number <= 5 as it's ordering by date, this will get the first 5 records for each user, if you needed the 5 most recent records for each user, you would do ORDER BY DATE DESC, yourtable would be the name of your actual database table:

--Partitioning by Name is not consistent if more than one user
--shares the same name, if you have a user id which you can use to
--partition it would be better, but the following will get you moving
--in the right direction.
SELECT NAME, DATE, PROJECT_ID
FROM
(
    SELECT NAME, DATE, PROJECT_ID, ROW_NUMBER ()
    OVER (PARTITION BY Name ORDER BY DATE) AS rownum
    FROM yourtable
) 
where rownum <= 5 

More information on the ROW_NUMBER function: bigquery-row-number-function

Ryan Wilson
  • 10,223
  • 2
  • 21
  • 40