3

Sample Data:

 product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
    qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
    price <- c(2.49,2.49,2.49,1.743,2.49,2.49,  2.093,2.093,2.11,2.11,2.11, 2.97)
    date <- c("2/23/15","2/23/15",  '3/16/15','3/16/15','5/16/15',  "6/18/15",  "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
    sampleData <- data.frame(product_id,    qty_ordered,    price,  date)

I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example, For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4; the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.

So the New Data Frame should be:

product_id sum_qty_ordered price    date_diff 
1000           4          2.490             21 
1000           1           1.743            61 
1000           2           2.490            33 

Here are what I have tried:

**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.

1) I found this code from Determine when columns of a data.frame change value and return indices of the change: This is to identify every time when the price changed, which identify the first date when the price changed for each product.

IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]

However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.

2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed), and then, summarise that subset to get sum(sum_qty_ordered) and the date diff. However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:

create an empty data frame for later data storage

 NewData_Ready <- data.frame(
                     product_id = character(),
                     price = double(),
                     early_date = as.Date(character()),
                     last_date=as.Date(character()),
                     total_qty_demanded = double(),                          
                     stringsAsFactors=FALSE) 

create a temp table to store the batch price order entries

 temp_dataset <- data.frame(
                     product_id = character(),
                     qty_ordered = double(),
                     price = double(),
                     date=as.Date(character()),                                  
                     stringsAsFactors=FALSE) 

loop: This is messy...and probably not make sense, so I do really help on this.

for ( i in unique(sampleData$product_id)){
    #for each unique product_id in the dataset, we are gonna loop through it based on product_id
    #for first product_id which is "1000"
    temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
    #this dataset only has product of "1000" entries

    #starting a new for loop to loop through the entire entries for this product
    for ( p in 1:length(temp_table$product_id)){

        current_price <- temp_table$price[p] #assign current_price to the first price value
        #assign $2.49 to current price. 
        min_date <- temp_table$date[p] #assign the first date when the first price change
        #assign 2015-2-23 to min_date which is the earliest date when price is $2.49

        while (current_price == temp_table$price[p+1]){
        #while the next price is the same as the first price 
        #that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
        #then execute the following statement

            temp_dataset <- rbind(temp_dataset, temp_table[p,])
            #if the WHILE loop is TRUE, means every 2 entries have the same price
            #then combine each entry when price is the same in temp_table with the temp_dataset

            #if the WHILE loop is FALSE, means one entry's price is different from the next one
            #then stop the statement at the above, but do the following
            current_price <- temp_table$price[p+1]
            #this will reassign the current_price to the next price, and restart the WHILE loop

            by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
            NewRow <- dplyr::summarise(
                                early_date = min(date),
                                last_date = max(date),
                                total_qty_demanded = sum(qty_ordered))
            NewData_Ready <- rbind(NewData_Ready, NewRow)



        }
    }

}

I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know. Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!

Here is my R version:
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          3.1                         
year           2016                        
month          06                          
day            21                          
svn rev        70800                       
language       R                           
version.string R version 3.3.1 (2016-06-21)
nickname       Bug in Your Hair      
Community
  • 1
  • 1
lemonC
  • 78
  • 10
  • Can you explain the date difference more? `21 60 94` doesn't seem correct for the example product id. – Pierre L Aug 29 '16 at 18:47
  • @PierreLafortune For product_id "1000" at price of $2.490, which is a date ranged from 2/23/15 (earliest) till 3/16/2015 (latest date before the price changed to $1.743). So the date between 2/23/15-3/16/15 is about 21 days. Then move to next price for product_id "1000", and same logic to calculate the date and got 61 days (Oops typo above) between 3/16/15-5/16/15 and so on. Please let me know if this makes sense. Thanks for your help!! – lemonC Aug 29 '16 at 19:05
  • But there was no price change for `6/18/15`. Why was `94` coded as a price change gap as in the others? – Pierre L Aug 29 '16 at 19:17
  • @PierreLafortune You are right. My fault. It should be 33 days between 5/16/15-6/18/15 price change from 1.743-2.490 assuming the product's price remains the same since 3/16/15 as $1.743. So 5/16/2015 will the first date when price changed. – lemonC Aug 29 '16 at 20:01

1 Answers1

4

Using data.table:

library(data.table)
setDT(sampleData)

Some Preprocessing:

sampleData[, firstdate := as.Date(date, "%m/%d/%y")]

Based on how you calculate date diff, we are better off creating a range of dates for each row:

sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]

Then create a new ID for every change in price:

sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]

Then calculate your groupwise functions, by product and price run:

sampleData[,
           .(
             price = unique(price),
             sum_qty = sum(qty_ordered),
             date_diff = max(lastdate) − min(firstdate) 
           ),
           by = .(
             product_id,
             price_id
           )
           ]

   product_id price_id price sum_qty date_diff
1:       1000        0 2.490       4   21 days
2:       1000        1 1.743       1   61 days
3:       1000        2 2.490       2   33 days
4:       1002        0 2.093       3   28 days
5:       1002        1 2.110       4   31 days
6:       1002        2 2.970       1    0 days

I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1

Chris
  • 6,302
  • 1
  • 27
  • 54
  • thank you so much! This is perfect for what I want! One question, would you mind explaining this code `cumsum(c(0,diff(price) != 0))` a bit of more please? I know cumsum() but not quite understand the diff() with not equal to 0. – lemonC Aug 29 '16 at 20:04
  • @lemonC Sure: first we take `diff(price)`, which calculates the difference between all values in a vector. With `diff(price) != 0`, I convert this to a vector of `T/F` with `TRUE` if the price is different (i.e. the start of a new group). I then concatenate zero to the start of this, as diff gave `n - 1` values in the return. This step also converts the `T/F`s to `1/0`. Now I have a vector with 1 at every position of a price change. Taking the `cumsum` of this creates an autoincrement id for every group. – Chris Aug 29 '16 at 20:26
  • 3
    `shift(firstdate,type = "lead")` can be changed to `shift(firstdate, type="lead", fill=firstdate[.N])` so that the next step can be avoided, i.e., fill the NA entry with the last value directly. – Arun Aug 29 '16 at 20:27
  • @Chris wow, I never thought about this way. This is amazing! Thank you so much Chris! Very helpful! Really appreciate it! – lemonC Aug 29 '16 at 23:35
  • @Arun I tried the code! It's so convenient this way. Would you mind explaining more what does the [.N] means? Especially the dot (.) in front of the N please? – lemonC Aug 29 '16 at 23:37
  • @lemonC, glad! Have a look at the [vignettes](https://github.com/Rdatatable/data.table/wiki/Getting-started). Particularly the first (Introduction to data.table). Your specific question is answered under `1f`, but reading through the entire thing should take about 10-15 min.. – Arun Aug 29 '16 at 23:43
  • 1
    @Arun Thanks for the information! I will definetly look through the document and probably take the Data.Table class on Data Camp. Thanks for your work on this amazing package in R! You rock! – lemonC Aug 30 '16 at 15:57
  • Hi @Arun, I have a follow up question on identify when a value in a column changed. Above example is about date changes/integer changes. What about identify when a character value changes/remain the same? Is there any function is Data.Table for this process? My thought was, find a function which returns T/F when evaluate whether character value are the same or not, and replace it with diff(price). – lemonC Jan 17 '17 at 06:58