0

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.

Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.

I would be very gratefull if somebody could point me in the right direction.

Product    Quantity     Date      

A          1            03/12/15  
A          2            04/12/15  
A          1            05/12/15  
A          1            08/12/15  
A          1            17/12/16  
A          1            18/12/16  
B          1            19/12/15  
B          2            10/05/15  
B          2            11/05/15  
C          1            01/06/15  
C          1            02/06/15  
C          1            12/06/15  
user3387899
  • 601
  • 5
  • 18
  • 3
    I have a feeling that this kind of question has already been answered... – maj Aug 26 '15 at 11:45
  • Search for `data aggregation [r]` – ExperimenteR Aug 26 '15 at 11:50
  • Please provide a reproducible example with what exactly the output would look like. Have a look [here](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) if you need any tips on how to do that. – Roman Luštrik Aug 26 '15 at 11:59

3 Answers3

1

Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).

library(lubridate)
library(dplyr)


# create date time columns
dt$Date = dmy(dt$Date)


dt %>% 
  group_by(Product) %>% 
  do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>%     # create all combinations between product and days
  mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>%    # create distance of each day with min date
  ungroup() %>%
  left_join(dt, by=c("Product"="Product","days"="Date")) %>%              # join info to get quantities for each day
  mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity),                # replace NAs with 0s
         id = floor(dist/5 + 1)) %>%                                    # create the 4 period id
  group_by(Product, id) %>%
  summarise(Sum = sum(Quantity),
            min_date = min(days),
            max_date = max(days)) %>%
  ungroup


#    Product id Sum   min_date   max_date
# 1        A  1   4 2015-12-03 2015-12-07
# 2        A  2   1 2015-12-08 2015-12-12
# 3        A  3   0 2015-12-13 2015-12-17
# 4        A  4   0 2015-12-18 2015-12-22
# 5        A  5   0 2015-12-23 2015-12-27
# 6        A  6   0 2015-12-28 2016-01-01
# 7        A  7   0 2016-01-02 2016-01-06
# 8        A  8   0 2016-01-07 2016-01-11
# 9        A  9   0 2016-01-12 2016-01-16
# 10       A 10   0 2016-01-17 2016-01-21
# ..     ... .. ...        ...        ...

First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
1

I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).

So this is how it will look:

convert date into proper format:

library(dplyr) df=tbl_df(df) df$Date=as.Date(df$Date,format="%d/%m/%y")

calculating delivery terms

df=group_by(df,Product) %>% arrange(Date) df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)

group by product and terms and calculate sum of quantity:

df=group_by(df,Product,term) summarise(df,sum=sum(Quantity))

Maksim Gayduk
  • 1,051
  • 6
  • 13
0

Here's a base R way:

df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
  intrvl <- findInterval(x, seq(min(x), max(x),4))
  as.numeric(factor(intrvl))
  })
df
#    Product Quantity       Date groups
# 1        A        1 2015-12-03      1
# 2        A        2 2015-12-04      1
# 3        A        1 2015-12-05      1
# 4        A        1 2015-12-08      2
# 5        A        1 2016-12-17      3
# 6        A        1 2016-12-18      3
# 7        B        1 2015-12-19      2
# 8        B        2 2015-05-10      1
# 9        B        2 2015-05-11      1
# 10       C        1 2015-06-01      1
# 11       C        1 2015-06-02      1
# 12       C        1 2015-06-12      2

The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.

Data

df$Date <- as.Date(df$Date, format="%d/%m/%y")
Pierre L
  • 28,203
  • 6
  • 47
  • 69