1
customer_id  transaction_id    month  year 
          1    3                7     2014
          1    4                7     2014
          2    5                7     2014
          2    6                8     2014
          1    7                8     2014
          3    8                9     2015
          1    9                9     2015
          4    10               9     2015
          5    11               9     2015
          2    12               9     2015

I am well familiar with R basics. Any help will be appreciated.

the expected output should look like following:

month   year  number_unique_customers_added
 7      2014     2
 8      2014     0
 9      2015     3

In the month 7 and year 2014, only customers_id 1 and 2 are present, so number of customers added is two. In the month 8 and year 2014, no new customer ids are added. So there should be zero customers added in this period. Finally in year 2015 and month 9, customer_ids 3,4 and 5 are the new ones added. So new number of customers added in this period is 3.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • `library(dplyr); df %>% group_by(month, year) %>% summarise(new_cus = n_distinct(customer_id)) ` – Ronak Shah Oct 20 '18 at 09:17
  • @RonakShah the solution provided here only gives number of unique customers in a month of the given year. However, there could be overlap of the customers between any two months. So this solution doesn't answer my question. right ? – Kumar Ashutosh Oct 24 '18 at 09:03
  • yes, maybe. Can you update your post with your expected output ? If the answer provided by me doesn't solve your question I will reopen it. – Ronak Shah Oct 24 '18 at 09:07
  • @RonakShah pls have a look at updated question. – Kumar Ashutosh Oct 24 '18 at 09:39

2 Answers2

0

Using data.table:

require(data.table)

dt[, .SD[1,], by = customer_id][, uniqueN(customer_id), by = .(year, month)]

Explanation: We first remove all subsequent transactions of each customer (we're interested in the first one, when she is a "new customer"), and then count unique customers by each combination of year and month.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
0

Using dplyr we can first create a column which indicates if a customer is duplicate or not and then we group_by month and year to count the new customers in each group.

library(dplyr)
df %>%
  mutate(unique_customers = !duplicated(customer_id)) %>%
  group_by(month, year) %>%
  summarise(unique_customers = sum(unique_customers))

#  month  year unique_customers
#  <int> <int>            <int>
#1     7  2014                2
#2     8  2014                0
#3     9  2015                3
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213