2

I´d like to calculate a column based on a logic between tables. Let me explain, given a data.frame with this structure:

Transaction - Start - End - Quantity - Area
(Number) - (Date) - (Date) - (Number) - (Number)

Start and End determine how long has a Transaction been in an Area. I want to calculate the stock in each Area for each day.

What is stock? The sum of Quantity that follows this logic:

Start <= day AND End >= day
OR
Start <= day AND End == NULL

What is day? Each day of the calendar. (Thus the logic between tables.)

How can the stock per day per area be calculated in R? Example data:

calendar <- as.data.frame(seq.Date(as.Date("2019-01-01"), as.Date("2019-01-10"), 1))
colnames(calendar) <- c("Date")

> head(calendar)
        Date
1 2019-01-01
2 2019-01-02
3 2019-01-03
4 2019-01-04
5 2019-01-05
6 2019-01-06

Transaction <- c(299784, 299785, 301913, 302840, 305722, 285874, 285875, 312587, 326842, 328521)
Start <- as.Date(c("2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-03", "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-03"))
End <- as.Date(c("2019-01-05", "2019-01-04", "2019-01-06", "2019-01-03", "NULL", "2019-01-05", "2019-01-04", "2019-01-06", "2019-01-03", "NULL"))
Quantity <- c(1,1,1,1,1,1,1,1,1,1)
Area <- c(7065, 7065, 7065, 7065, 7065, 6098, 6098, 6098, 6098, 6098)
df <- data.frame(Transaction, Start, End, Quantity, Area)

> df
   Transaction      Start        End Quantity Area
1       299784 2019-01-01 2019-01-05        1 7065
2       299785 2019-01-01 2019-01-04        1 7065
3       301913 2019-01-02 2019-01-06        1 7065
4       302840 2019-01-02 2019-01-03        1 7065
5       305722 2019-01-03       <NA>        1 7065
6       285874 2019-01-01 2019-01-05        1 6098
7       285875 2019-01-01 2019-01-04        1 6098
8       312587 2019-01-02 2019-01-06        1 6098
9       326842 2019-01-02 2019-01-03        1 6098
10      328521 2019-01-03       <NA>        1 6098

The stock each day would be:

         Date  Area Stock
1  2019-01-01  7065     2
2  2019-01-02  7065     4
3  2019-01-03  7065     5
4  2019-01-04  7065     4
5  2019-01-05  7065     3
6  2019-01-06  7065     2
7  2019-01-07  7065     1
8  2019-01-08  7065     1
9  2019-01-09  7065     1
10 2019-01-10  7065     1
11 2019-01-01  6098     2
12 2019-01-02  6098     4
13 2019-01-03  6098     5
14 2019-01-04  6098     4
15 2019-01-05  6098     3
16 2019-01-06  6098     2
17 2019-01-07  6098     1
18 2019-01-08  6098     1
19 2019-01-09  6098     1
20 2019-01-10  6098     1

or:

         Date  7065  6098
1  2019-01-01     2     2
2  2019-01-02     4     4
3  2019-01-03     5     5
4  2019-01-04     4     4
5  2019-01-05     3     3
6  2019-01-06     1     1
7  2019-01-07     1     1
8  2019-01-08     1     1
9  2019-01-09     1     1
10 2019-01-10     1     1
AleG
  • 153
  • 8

1 Answers1

1

Edit3:

This approach has the following principle:
You need to calculate per day your stock but you have a range of days. We have to thus convert the range of days to single days, while maintaining the rest of the data, and then group and count as below.
However you have the 'annoying' NAs in there, thus we have to first get rid of those. Since when the End Date is NA, you want to still consider the transaction as ongoing, first i'd convert the NAs as the maximum date of your calendar dataframe, so we'll count them later on as 1 for each day till max:

df$End <- as.Date(ifelse(is.na(df$End), max(calendar$Date), df$End), origin = "1970-01-01")
> df
   Transaction      Start        End Quantity Area
1       299784 2019-01-01 2019-01-05        1 7065
2       299785 2019-01-01 2019-01-04        1 7065
3       301913 2019-01-02 2019-01-06        1 7065
4       302840 2019-01-02 2019-01-03        1 7065
5       305722 2019-01-03 2019-01-10        1 7065
6       285874 2019-01-01 2019-01-05        1 6098
7       285875 2019-01-01 2019-01-04        1 6098
8       312587 2019-01-02 2019-01-06        1 6098
9       326842 2019-01-02 2019-01-03        1 6098
10      328521 2019-01-03 2019-01-10        1 6098

After this we need to generate the missing dates, between the Start-End dates. To do so we can use complete from tidyr as per MKR's example, in the following way:

library(tidyr)
nf <- df %>% group_by(row_number()) %>% complete(Start=seq.Date(max(Start), max(End), by='day')) %>% fill(Transaction, End, Quantity, Area)

Our new dataframe nf now has a new Start date that corresponds to each Transaction/Quantity/Area combination unique for the date ranges.

> nf
# A tibble: 48 x 6
# Groups:   row_number() [10]
   `row_number()` Start      Transaction End        Quantity  Area
            <int> <date>           <dbl> <date>        <dbl> <dbl>
 1              1 2019-01-01      299784 2019-01-05        1  7065
 2              1 2019-01-02      299784 2019-01-05        1  7065
 3              1 2019-01-03      299784 2019-01-05        1  7065
 4              1 2019-01-04      299784 2019-01-05        1  7065
 5              1 2019-01-05      299784 2019-01-05        1  7065
 6              2 2019-01-01      299785 2019-01-04        1  7065
 7              2 2019-01-02      299785 2019-01-04        1  7065
 8              2 2019-01-03      299785 2019-01-04        1  7065
 9              2 2019-01-04      299785 2019-01-04        1  7065
10              3 2019-01-02      301913 2019-01-06        1  7065
# … with 38 more rows

We can then proceed with what I already suggested earlier:

zf <- nf %>% group_by(Start, Area) %>% tally(Quantity)

> zf
# A tibble: 20 x 3
# Groups:   Start [10]
   Start       Area     n
   <date>     <dbl> <dbl>
 1 2019-01-01  6098     2
 2 2019-01-01  7065     2
 3 2019-01-02  6098     4
 4 2019-01-02  7065     4
 5 2019-01-03  6098     5
 6 2019-01-03  7065     5
 7 2019-01-04  6098     4
 8 2019-01-04  7065     4
 9 2019-01-05  6098     3
10 2019-01-05  7065     3
11 2019-01-06  6098     2
12 2019-01-06  7065     2
13 2019-01-07  6098     1
14 2019-01-07  7065     1
15 2019-01-08  6098     1
16 2019-01-08  7065     1
17 2019-01-09  6098     1
18 2019-01-09  7065     1
19 2019-01-10  6098     1
20 2019-01-10  7065     1

What we are doing here, is we are asking dplyr to use the Start (which ended up being our unique transaction for that date) and the Area variables to give you the total of Quantities. You can then store this in a new table, in this example zf.

Edit1: To end-up with your requested table format you can then run spread from the tidyr package

zf <-  zf %>% spread(Area, n)
>zf
# A tibble: 10 x 3
# Groups:   Start [10]
   Start      `6098` `7065`
   <date>      <dbl>  <dbl>
 1 2019-01-01      2      2
 2 2019-01-02      4      4
 3 2019-01-03      5      5
 4 2019-01-04      4      4
 5 2019-01-05      3      3
 6 2019-01-06      2      2
 7 2019-01-07      1      1
 8 2019-01-08      1      1
 9 2019-01-09      1      1
10 2019-01-10      1      1

This spreads the column Area on new columns based on your counts (n). Finally all you have to do is keep the dates that exist in your calendar dataframe.

Hope this helps!

Steve
  • 392
  • 2
  • 10
  • Hi, thanks for the answer. The End date is relevant. For example, the 8th of January there is a stock of 1 for Area 7065 because there is a transaction open (5th row of 'df'), meaning, 2018-01-03 <= 2018-01-08 and End == null. For the 2nd of January there is a stock of 4 for Area 7065 (accounting for the 2º, 3º, 4º and 5º row of 'df'). – AleG Jul 12 '19 at 08:39
  • I see your point now - i misunderstood the complexity then. I will get back to you. – Steve Jul 12 '19 at 08:45
  • Questions: On your calendar you have 2019 dataframe you have 2019 and on your df you have 2018. Is this a typo? Otherwise all the days will be > than your df$End. I am editing my response currently to reflect your comments. – Steve Jul 12 '19 at 10:13
  • Check my edited response, let me know if this is what you are looking for. – Steve Jul 12 '19 at 10:25
  • Thanks a lot for the help Steve, that definitely works. The only caveat is that populating the missing dates makes the table much bigger (imagine a production table like this where a transaction has an average lead time of 60 days, the table would increase its size by 60!) However I think I can manage that with careful filtering of the data. Thanks! – AleG Jul 12 '19 at 10:40
  • 1
    Glad it helps! I made a mistake in the grouping by the way: You need to use `tally(Quantity)` instead of `count()` since we want the sum of Quantity and not just the number of dates appearing. I made a final edit. Good luck and keep up the good work! – Steve Jul 12 '19 at 10:42