1

Suppose I have an ordered data frame that looks like this:

df <- data.frame(customer = c('cust1','cust1','cust2','cust3','cust3'),
start_month = as.Date(c('2016-03-01','2017-08-01','2016-03-01','2017-07-01','2017-10-01')),
price = c(29,29,59,99,59),
end_month = as.Date(c('2017-08-01',NA,'2017-09-01','2017-09-01',NA)));

How can I write a script in R with the following business rule: if a customer ended and started in the same month, and the price did not change, remove the latest transaction. Else, keep the transaction. The resulting data frame would look like this:

new_df <- data.frame( customer = c('cust1','cust2','cust3','cust3'),
start_date = as.Date(c('2016-03-01','2016-03-01','2017-07-01','2017-10-01')),
price = c(29,59,99,59),
end_date = as.Date(c(NA,'2017-09-01','2017-09-01',NA)));

In this example, cust1's 2017-08-01 is ignored and filtered out because the price is the same as their previous transaction. However, cust3's transaction is kept because the price is different.

How can I do this in R?

Saul Feliz
  • 668
  • 3
  • 11
  • 20

1 Answers1

1
library(dplyr)
df <- df %>% group_by(customer) %>% mutate(change = lag(price) - price)

> df
# A tibble: 5 x 5
# Groups:   customer [3]
  customer start_month price  end_month change
    <fctr>      <date> <dbl>     <date>  <dbl>
1    cust1  2016-03-01    29 2017-08-01     NA
2    cust1  2017-08-01    29         NA      0
3    cust2  2016-03-01    59 2017-09-01     NA
4    cust3  2017-07-01    99 2017-09-01     NA
5    cust3  2017-10-01    59         NA     40

The first entry for a customer is always NA, and we will keep those. And we will remove the rows where the price did not change:

df <- df %>% filter(is.na(change) | change != 0)

> df
# A tibble: 4 x 5
# Groups:   customer [3]
  customer start_month price  end_month change
    <fctr>      <date> <dbl>     <date>  <dbl>
1    cust1  2016-03-01    29 2017-08-01     NA
2    cust2  2016-03-01    59 2017-09-01     NA
3    cust3  2017-07-01    99 2017-09-01     NA
4    cust3  2017-10-01    59         NA     40

All in one command:

library(dplyr)
df <-
  df %>% 
  group_by(customer) %>% 
  mutate(change = lag(price) - price) %>% 
  filter(is.na(change) | change != 0)

EDIT:

I forgot to check if the date changes:

library(dplyr)
df <-
  df %>% 
  group_by(customer) %>% 
  mutate(change = lag(price) - price) %>%
  mutate(date_change = lag(end_month) - start_month) %>%
  filter((is.na(change) | change != 0) | (is.na(date_change) | date_change != 0))

this will keep every first entry and remove rows where the starting date is the same as the previous ending date AND the price did not change.

f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
  • ahh yes. I see. This makes sense. I see I need to use dplyr as a 'do this, then do this, then do this' type of thinking. Would you say that's correct? – Saul Feliz Jan 03 '18 at 16:08
  • what if I want to remove every transaction associated with that customer, instead of keeping one? Filter seems to be like SELECT in SQL where it keeps. How do I remove something that meets the criteria I want, besides using the opposite logical operators? – Saul Feliz Jan 03 '18 at 21:35
  • the type of thinking is correct :) I'm not aware of a way to remove certain rows besides using filter and telling it which rows you want to keep. In this question they've used `anti_join()`, maybe that's worth a look: https://stackoverflow.com/questions/45661377/delete-rows-based-on-multiple-conditions-with-dplyr. If you wanted to remove every transaction of customer 3 you could just do this: `df %>% filter(customer != "cust3")` – f.lechleitner Jan 04 '18 at 08:49