3

So I have a datatable df with column ID DATE and STOCK

In this table, the same ID has multiple values with their date and stock:

ID        DATE        STOCK
a1     2017-05-04       1
a1     2017-06-04       4
a1     2017-06-05       1
a1     2018-05-04       1
a1     2018-06-04       3
a1     2018-06-05       1
a2     2016-11-26       2
a2      ...             ..

Using lubridate I can get which week a date is as follows:

dfWeeks=df[,"WEEK" := floor_date(df$`Date`, "week")]

ID        DATE        STOCK        WEEK
a1     2017-05-04       1       2017-04-30
a1     2017-06-04       4       2017-06-04
a1     2017-06-05       1       2017-06-04
a1     2018-05-04       1       2018-04-29
a1     2018-06-04       3       2018-06-03
a1     2018-06-05       1       2018-06-03
a2     2016-11-26       2       2016-11-20
a2      ...             ..

So from column DATE I know my old date is 2017-05-04 and newest date 2018-06-05, which has about 56.71429 weeks:

dates <- c( "2017-05-04","2018-06-05")
dif <- diff(as.numeric(strptime(dates, format = "%Y-%m-%d")))/(60 * 60 * 24 * 7) 

And my table has only 4 unique weeks, so the idea is to sum stock for each week and insert the missing (57-4=53 weeks) ones with 0 value in stock.

Then I can do the mean of all the weeks like

meanStock<- dfWeeks[, .(mean=sum(Stock, na.rm = T)/dif <- diff(as.numeric(strptime(c(min(Date), max(Date)), format = "%Y-%m-%d")))/(60 * 60 * 24 * 7) ), by = .(ID)]

But I don't know if it will work, Hope I made it clear and any advice or approach is welcomed.

UPDATE:

This is how I get the max and min date

max = aggregate(df$`Date`,by=list(df$ID),max)
colnames(max) = c("ID", "MAX")
min = aggregate(df$`Date`,by=list(df$ID),min)
colnames(min) = c("ID", "MIN")
test <- merge(max, min, by="ID", all=T)
Programmer Man
  • 1,314
  • 1
  • 9
  • 29
  • Can you provide a subset of data to work with? – m0nhawk Nov 20 '18 at 16:46
  • It's confidential but basically follows same logic as shown in the question for df – Programmer Man Nov 20 '18 at 16:58
  • Can you provide at least an example of dataset in usable format (so people won't try to parse the data)? – m0nhawk Nov 20 '18 at 16:59
  • Somewhat related: [Insert rows for missing dates/times](https://stackoverflow.com/questions/16787038/insert-rows-for-missing-dates-times); [Fastest way to add rows for missing values in a data.frame?](https://stackoverflow.com/questions/10438969/fastest-way-to-add-rows-for-missing-values-in-a-data-frame?noredirect=1&lq=1) and heaps of alternatives in Linked therein. – Henrik Nov 20 '18 at 17:47

1 Answers1

1

Something like:

library(data.table)

setDT(df)[, DATE := as.Date(DATE)][, `:=` (st = min(DATE), end = max(DATE) + 7), by = ID][
  , .(ID = ID, DATE = DATE, STOCK = STOCK, Expanded = seq(st, end, by = "week")), by = 1:nrow(df)][
    , `:=` (WEEK = floor_date(Expanded, "week"), WEEK2 = floor_date(DATE, "week"))][
      WEEK != WEEK2, STOCK := 0][
        , .(SUM_STOCK = sum(STOCK)), by = .(WEEK, ID)]

Output (rows for the weeks of 2017-04-02 until 2017-06-11 and ID a1):

          WEEK ID SUM_STOCK
 1: 2017-04-02 a1         0
 2: 2017-04-09 a1         0
 3: 2017-04-16 a1         0
 4: 2017-04-23 a1         0
 5: 2017-04-30 a1         1
 6: 2017-05-07 a1         0
 7: 2017-05-14 a1         0
 8: 2017-05-21 a1         0
 9: 2017-05-28 a1         0
10: 2017-06-04 a1         5
11: 2017-06-11 a1         0
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • This looks very good But I need to keep the ID as well, also st = as.Date("2017-05-04"), end = as.Date("2018-06-15") is not the same for each id, I have managed to get the max and min date I will Update the question – Programmer Man Nov 20 '18 at 17:05
  • Check update thats how I get the max and min date for each ID – Programmer Man Nov 20 '18 at 17:08
  • Have a look, I've updated the code. Now it runs for each ID, and expands the data by each min + max (I add 7 days to maximum to be sure all weeks are captured, you can filter that out later). – arg0naut91 Nov 20 '18 at 17:12
  • Thats good but why I have more weeks for ID a1 i know between old and new i have 57 weeks but in df I got more entires? – Programmer Man Nov 21 '18 at 09:21
  • It depends on what you take as maximum date for ID a1. If you take "2018-06-05", then R will calculate it to the previous week (e.g. if you do a sequence, you will get the last week in 2018-05, i.e. May and not June). If you want to include the first week of July, you need to add +7 days to "2018-06-05", then also the difference in weeks becomes 57.7 and not 56.7. – arg0naut91 Nov 21 '18 at 10:19
  • I don't understand, But i do know the number of weeks is calculated from newest and oldest date , what I don't understand how is it possible that there are more weeks in that interval – Programmer Man Nov 21 '18 at 10:34
  • You can put end = max(as.Date(DATE))) in the first line instead of end = max(as.Date(DATE)) + 7), and see if you get the result you want. I think in that case you will end up with your desired number of weeks, but you will lose observations effectively. – arg0naut91 Nov 21 '18 at 11:49
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184031/discussion-between-programmer-man-and-arg0naut). – Programmer Man Nov 21 '18 at 11:56