5

I am trying to add a column to an existing dataframe, such that the column defines the number of different products, each user has bought. A toy example is

Customer    Product
1           Chocolate
1           Candy
1           Soda
2           Chocolate
2           Chocolate
2           Chocolate
3           Insulin
3           Candy

Where the output should be

Customer    Product     #Products
1           Chocolate   3
1           Candy       3
1           Soda        3
2           Chocolate   1
2           Chocolate   1
2           Chocolate   1
3           Insulin     2
3           Candy       2

I would like to do this without a for loop, since I have millions of rows, and it would take forever. I have used data.table and other methods in order to just get the number of products for each customer, but I don't know how to easily add this as a column to the existing dataframe.

Thanks in advance!

tkoz_dk
  • 239
  • 1
  • 4
  • 13
  • 1
    Start with the [HTML vignettes](https://github.com/Rdatatable/data.table/wiki/Getting-started) to get started with data.table.. – Arun Jun 02 '15 at 11:34
  • Thank you @Arun - I have been looking for something like that! – tkoz_dk Jun 02 '15 at 12:04

3 Answers3

3

In base R I would suggest ave:

within(mydf, {
    count = ave(Product, Customer, FUN = function(x) length(unique(x)))
})
##   Customer   Product count
## 1        1 Chocolate     3
## 2        1     Candy     3
## 3        1      Soda     3
## 4        2 Chocolate     1
## 5        2 Chocolate     1
## 6        2 Chocolate     1
## 7        3   Insulin     2
## 8        3     Candy     2

You could also try the "data.table" package:

library(data.table)
as.data.table(mydf)[, count := length(unique(Product)), by = Customer][]
##    Customer   Product count
## 1:        1 Chocolate     3
## 2:        1     Candy     3
## 3:        1      Soda     3
## 4:        2 Chocolate     1
## 5:        2 Chocolate     1
## 6:        2 Chocolate     1
## 7:        3   Insulin     2
## 8:        3     Candy     2
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Beautiful Ananda! I had been trying to make it work with the data.table (which is completely new to me), since it is so fast, but I could not understand the syntax to get all the columns of variables out as well as the result. Could you explain the syntax, you have used? – tkoz_dk Jun 02 '15 at 10:49
  • @tkoz_dk, I just edited the code slightly. Does it make more sense now? The `:=` is to assign by reference, and the `[]` at the end is just to print the output. – A5C1D2H2I1M1N2O1R2T1 Jun 02 '15 at 10:51
  • Yes, now I understand it. Thank you very much for your help! – tkoz_dk Jun 02 '15 at 10:54
  • @AnandaMahto which is the fastest and memory efficient – The6thSense Jun 02 '15 at 11:44
  • 1
    @VigneshKalai, "data.table". – A5C1D2H2I1M1N2O1R2T1 Jun 02 '15 at 13:33
  • Thanks @AnandaMahto := assign by reference I can't seem to understand it and can you explain the ave function a little – The6thSense Jun 02 '15 at 13:47
  • 1
    @VigneshKalai, please take a look at `??":="` to get an understanding of assignment by reference, and `?ave` to get an understanding of the `ave` function. Basically `ave` would run a function over a set of values grouped by another set of values. – A5C1D2H2I1M1N2O1R2T1 Jun 02 '15 at 13:51
  • Thanks for your precious time @AnandaMahto understood ave – The6thSense Jun 02 '15 at 13:53
  • @AnandaMahto Can the data.table also be used to create different numbers in the new column? For instance if I want a column that indicates a switch between products? Customer 1 would then have values "First.prod, switch, switch" and customer 2 would have values "First.prod, same, same" etc. Is this possible? I.e. a function that looks at a previous entry like match does. – tkoz_dk Jun 08 '15 at 09:07
1

You should be good with something like that (suppose df is your data):

df.agr=aggregate(Product~Customer,data=df, FUN=function(x) length(unique(x)))
df=cbind(df, Count=apply(df, MARGIN=1, FUN=function(r) df.agr$Product[match(r[1],df.agr$Customer)]))

It won't be blazing fast, but definitely faster than for.

cyberj0g
  • 3,707
  • 1
  • 19
  • 34
  • Thanks for the answer. It's almost there, I think - I have not formulated it clear enough in the initial question: The code gives me a count of the number of items a customer has bought, and i would like a count of the number of different items he/she has bought - see the output example for clarification. – tkoz_dk Jun 02 '15 at 10:17
  • And it works like a charm! Thanks! – tkoz_dk Jun 02 '15 at 10:25
1

I use plyr for anything that involves split-apply-combine. In this case, we split the data by Customer and apply the length-unique function on Product, then combine the results

require(plyr)
ddply(df, .(Customer), transform, num.products = length(unique(Product)))

  Customer   Product num.products
1        1 Chocolate            3
2        1     Candy            3
3        1      Soda            3
4        2 Chocolate            1
5        2 Chocolate            1
6        2 Chocolate            1
7        3   Insulin            2
8        3     Candy            2

Bonus in case you want a smaller summary dataframe out of this.

ddply(df, .(Customer), summarize, num.products = length(unique(Product)))

  Customer num.products
1        1            3
2        2            1
3        3            2
selwyth
  • 2,417
  • 16
  • 19