0

Not the best title... but I'm trying to track tasks outstanding using weekly data exports. I want to use the total task count as of the minimum date (e.g. Jan 1st) of each year as a variable to return percent remaining (proportion) each week throughout that year. See example:

dt <- data.table(id.date=as.Date(c(rep('2019-01-01',80),rep('2019-01-08',60),rep('2019-01-15',40),
                         rep('2019-01-22',40),rep('2020-01-01',70),rep('2020-01-08',50),
                         rep('2020-01-15',40),rep('2020-01-29',20))),
           task.type=rep('taskA'))[order(id.date)][,year:=year(id.date)]

# 2 sample records per weekly report
dt[,.SD[sample(.N,min(2,.N))],by=id.date]

 id.date  task.type year
  <date>    <chr>  <int>
2019-01-01  taskA   2019        
2019-01-01  taskA   2019        
2019-01-08  taskA   2019        
2019-01-08  taskA   2019        
2019-01-15  taskA   2019        
2019-01-15  taskA   2019        
2019-01-22  taskA   2019        
2019-01-22  taskA   2019        
2020-01-01  taskA   2020        
2020-01-01  taskA   2020
2020-01-08  taskA   2020        
2020-01-08  taskA   2020    
2020-01-15  taskA   2020
2020-01-15  taskA   2020    
2020-01-29  taskA   2020    
2020-01-29  taskA   2020

And the desired output:

# for weekly reports in 2019, 80 tasks would be used for all reports in 2019
# and for 2020, 70 would be used.
   id.date   N    pct.rem
  <date>   <int>    <dbl>
2019-01-01  80  1.0000000       
2019-01-08  60  0.7500000       
2019-01-15  40  0.5000000       
2019-01-22  40  0.5000000       
2020-01-01  70  1.0000000       
2020-01-08  50  0.7142857       
2020-01-15  40  0.5714286       
2020-01-29  20  0.2857143
    

I've made attempts reworking some of the answers from this SO discussion here. But no success. My guess is that I need to utilize .SD or .EACHI is some manner but I'm just starting out with data.table.

Any guidance here is much appreciated. Let me know if more clarity would be helpful. Thanks.

2 Answers2

1

You can also do something like this

out <- dt[, .(.N), by = id.date][, pct.rem := N / N[[1L]], by = year(id.date)]

Output

> out[]
      id.date  N   pct.rem
1: 2019-01-01 80 1.0000000
2: 2019-01-08 60 0.7500000
3: 2019-01-15 40 0.5000000
4: 2019-01-22 40 0.5000000
5: 2020-01-01 70 1.0000000
6: 2020-01-08 50 0.7142857
7: 2020-01-15 40 0.5714286
8: 2020-01-29 20 0.2857143

If you have multiple types of tasks per id.date, then try this

out <- dt[, .(.N), by = .(id.date, task.type)][, pct.rem := N / N[[1L]], by = .(year(id.date), task.type)]

Output

> out[]
      id.date task.type  N   pct.rem
1: 2019-01-01     taskA 80 1.0000000
2: 2019-01-08     taskA 60 0.7500000
3: 2019-01-15     taskA 40 0.5000000
4: 2019-01-22     taskA 40 0.5000000
5: 2020-01-01     taskA 70 1.0000000
6: 2020-01-08     taskA 50 0.7142857
7: 2020-01-15     taskA 40 0.5714286
8: 2020-01-29     taskA 20 0.2857143
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • many thanks. For clarity could you confirm the usage of N[[1L]]; is this to subset the parent dt's 1st record of each group column "N"? – larry_opoly Jan 07 '22 at 20:01
  • 1
    @larry_opoly `dt[, .(.N), by = id.date]` returns a data.table with a column `N` containing the count of tasks by each `id.date`. Hence, `N[[1L]]` is the first count of tasks for each year and also the total number of tasks for that year. For example, for year 2019, `N[[1L]]` is 80. – ekoam Jan 07 '22 at 21:52
0

It seems like

dt[, N := .N, by = id.date]
dt[, pct.rem := N/first(N), by = year]
# dt[, pct.rem := N/.SD[1,N], by = year] 
# .SD[1,N] by year gives the total task's number of each year

dt[,.SD[1,.(N,pct.rem)], by = id.date]

If you don't want to change dt, you can also use copy(dt) instead.

output:

      id.date     N   pct.rem
       <Date> <int>     <num>
1: 2019-01-01    80 1.0000000
2: 2019-01-08    60 0.7500000
3: 2019-01-15    40 0.5000000
4: 2019-01-22    40 0.5000000
5: 2020-01-01    70 1.0000000
6: 2020-01-08    50 0.7142857
7: 2020-01-15    40 0.5714286
8: 2020-01-29    20 0.2857143
Peace Wang
  • 2,399
  • 1
  • 8
  • 15
  • I get a coercion error after running the second dt: Error in as.vector(x, "list") : cannot coerce type 'closure' to vector of type 'list'. is there something different between our environments? I'm running data.table 1.14.2 – larry_opoly Jan 07 '22 at 20:16
  • I am using the latest `data.table 1.14.3 IN DEVELOPMENT`. Can you test the updated answer? – Peace Wang Jan 08 '22 at 00:29