0

I am hoping I can solve this in T-SQL if appropriate. Any help is appreciated, I have seen other similar questions, but the fact I have multiple entries for the same person make it difficult.

I have a large dataset with IDs (unique for each person) and dates from 2015-2020. This is prescription data for individuals (ID) and their fill dates, so there are typically multiple rows for each ID - both within a year and across multiple years.

I want to randomly pick one date per ID/person that follows the following proportion/probability: 5%-2015, 10%-2016, 10%-2017, 15%-2018, 20%-2019, and 40% 2020. There are 1.2 million unique IDs/people, and about 300,000 people have a fill in 2020 which seems like a limiting factor.

  • 2
    What if a person has no rows in the table in some year[s]? – Serg Sep 03 '21 at 19:21
  • Are you wanting to track discrete persons longitudinally and therefore needing 5 years of data for each person? Or are you wanting to randomly sample 5% of all 2015 customers, 10% of all 2016, etc. without concern for whether any given individual appears in across all sets? – paneerakbari Sep 03 '21 at 19:26
  • Does this answer your question? [SQL - 5% random sample by group](https://stackoverflow.com/questions/20056376/sql-5-random-sample-by-group) – paneerakbari Sep 03 '21 at 20:02
  • @paneerakbari The latter, and I mostly just want to optimize the sample size. The part making things difficult is people having multiple dates across years and it's not uniform, it makes it difficult to optimize one year (2020) and then have a large sample size for the remaining years. Thank you for the link. – Michelle Newell Sep 03 '21 at 20:40
  • I entered an answer that may help with the "random date" portion. On reading your response to my question, I suspect it may be the case that your query could `INNER JOIN` to a subquery of the top n% of random dates for each given year. If that does work for you, I can edit my answer to make a more useful record for others – paneerakbari Sep 03 '21 at 20:44
  • 1
    Are the number of distinct people per year roughly even, or do you need to handle other possible limits? That is to say, suppose there are 300,000 distinct people in 2020, and so you want 40% of that, ie, 120,000 people. That would imply that you want ~60,000 people from 2019. What if there are only 30,000 people in 2019? Would you reduce the number of people from 2020 in order to maintain the proportions, or would you just take as many as you could from 2019? – allmhuran Sep 03 '21 at 21:50
  • 1
    I suppose another question is: do you want to maximize the sample size in total even if it means some of the same people need to be picked across different years? For example, suppose Bob is a member of the 2015 sample population. If we are able to pick Bob again for the 2016 population, then we can pick additional people from the 2017+ populations and maintain our proportions. If we don't pick Bob again in 2016 then we have to reduce our total sample size across all years in order to maintain our proportions. Do we pick Bob for both 2015 and 2016? – allmhuran Sep 03 '21 at 22:31
  • @allmhuran Yes, there are additional limits,, especially due to users with multiple years of use and I could run out of sample. I do want to just maintain the proportions, so I would reduce the sample size. And I am hoping to select just one date per person, so Bob should only be selected once. – Michelle Newell Sep 04 '21 at 20:52
  • Heh, well well well, finding the optimum solution here - ie, truly maximizing the total sample size across all years without selecting any person more than once and maintaining the specified proportions - is indeed a *very* tricky problem to solve, especially via SQL! I promise I will think about it. I don't promise I can figure it out :D – allmhuran Sep 04 '21 at 21:19

2 Answers2

1

This is a very tricky problem. Basically, the issue is choosing only one id per person. For instance, to get the distribution by year that you want, you could use:

select t.*
from (select t.*, rand(checksum(newid())) as rnd
      from t
     ) t cross join
     (values (2015, 0, 0.05), (2016, 0.05, 0.15), (2017, 0.15, 0.25), 
             (2018, 0.25, 0.40), (2019, 0.40, 0.60), (2020, 0.60, 1)
     ) v(lo, hi)
where rnd >= lo and rnd < hi;

However, that does not guarantee one row per person. It does (to the extent possible) guarantee the distribution you want.

The simplest method is to randomly choose one year per person and then do the stratified sampling:

select t.*
from (select t.*, rand(checksum(newid())) as rnd,
             row_number() over (partition by person order by newid()) as seqnum
      from t
     ) t cross join
     (values (2015, 0, 0.05), (2016, 0.05, 0.15), (2017, 0.15, 0.25), 
             (2018, 0.25, 0.40), (2019, 0.40, 0.60), (2020, 0.60, 1)
     ) v(lo, hi)
where seqnum = 1 and rnd >= lo and rnd < hi;

This chooses one random year per person and if there are an equal number of rows per year, then it should match the distribution you want.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

This doesn't address the PersonID part of the question, but it does get to the point of generating random dates that fall within a given year:

First, I wanted to generate a sizable sample dataset of IDs and intervals. All of that work is done in the subquery inside the FROM clause. This fills a temp table with 30k fake IDs randomly generated numbers. The modulus of 366 caps the random interval.

DROP TABLE IF EXISTS #Table1;

SELECT 1001999 + n AS ID,
       DATEADD(DAY, RandDate, '2015-01-01') AS RandomDate
  INTO #Table1
  FROM (SELECT TOP (30000)
               ROW_NUMBER() OVER (ORDER BY s1.[object_id]) AS n,
               ABS(CHECKSUM(NEWID()) % 366) AS RandDate
          FROM sys.all_objects      AS s1
         CROSS JOIN sys.all_objects AS s2
         ORDER BY s1.[object_id]) AS x;

SELECT TOP 5 PERCENT *
  FROM #Table1 AS t;

In the outer query, the RandDate interval value is used in a DATEADD function to get a randomly-selected date within the given year. I'm using 2015 for this example, but this can be repeated for any year.

Finally, since you want a sample size of 5% of all the 2015 records, I select just the TOP 5 PERCENT from the sample data, getting a result set like this:

ID RandomDate
1014920 2015-01-09 00:00:00.000
1014921 2015-06-22 00:00:00.000
1014922 2015-10-20 00:00:00.000
1014923 2015-01-29 00:00:00.000
1014924 2015-01-03 00:00:00.000
1014925 2015-05-05 00:00:00.000
1014926 2015-10-09 00:00:00.000
1014927 2015-08-29 00:00:00.000
1014928 2015-03-20 00:00:00.000

If you need to do any amount of validation that these dates exist as record dates in your prescription fill records, you could INNER JOIN this set of random dates to that table, getting you only active dates.

paneerakbari
  • 680
  • 1
  • 4
  • 17