4

Some time ago I asked a question about creating market basket data. Now I would like to create a similar data.frame, but based on a third variable. Unfortunately I run into problems trying. Previous question: Effecient way to create market basket matrix in R
@shadow and @SimonO101 gave me good answers, but I was not able to alter their anwser correctly. I have the following data:

Customer <- as.factor(c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003))
Product <- as.factor(c(100001,100001,100001,100004,100004,100002,100003,100003,100003,100002,100003,100008))
input <- data.frame(Customer,Product)

I can create a contingency table now the following way:

input_df <- as.data.frame.matrix(table(input))

However I have a third (numeric) variable which I want as output in the table.

Number <- c(3,1,-4,1,1,1,1,1,1,1,1,1) 
input <- data.frame(Customer,Product,Number)

Now the code (of course, now there are 3 variables) does not work anymore. The result I am looking for has unique Customer as row names and unique Product as column names. And has Number as value (or 0 if not present), this number could be calculated by:

input_agg <- aggregate( Number ~ Customer + Product, data = input, sum)

Hope my question is clear, please comment if something is not clear.

Community
  • 1
  • 1
Freddy
  • 419
  • 8
  • 16

2 Answers2

6

You can use xtabs for that :

R> xtabs(Number~Customer+Product, data=input)

         Product
Customer  100001 100002 100003 100004 100008
  1000001      0      1      0      2      0
  1000002      0      0      3      0      0
  1000003      0      1      1      0      1
juba
  • 47,631
  • 14
  • 113
  • 118
  • @SimonO101 In fact, I just discovered it :-) – juba Oct 22 '13 at 14:31
  • I think it's not a question of efficiency but of how much RAM you have on your system. Whatever method you use to compute it, you will always end up with a 90000x2000 table to store in memory... – juba Oct 22 '13 at 15:09
  • @juba, That is correct indeed, should had known that it wouldn't fit. – Freddy Oct 22 '13 at 15:18
4

This class of problem is designed for reshape2::dcast...

require( reshape2 )
#  Too many rows so change to a data.table.
dcast( input , Customer ~ Product , fun = sum , value.var = "Number" )
#  Customer 100001 100002 100003 100004 100008
#1  1000001      0      1      0      2      0
#2  1000002      0      0      3      0      0
#3  1000003      0      1      1      0      1

Recently, the method for using dcast with data.table object was implemented by @Arun responding to FR #2627. Great stuff. You will have to use the development version 1.8.11. Also at the moment, it should be used as dcast.data.table. This is because dcast is not a S3 generic yet in reshape2 package. That is, you can do:

require(reshape2)
require(data.table)
input <- data.table(input)   
dcast.data.table(input , Customer ~ Product , fun = sum , value.var = "Number")
#    Customer 100001 100002 100003 100004 100008
# 1:  1000001      0      1      0      2      0
# 2:  1000002      0      0      3      0      0
# 3:  1000003      0      1      1      0      1

This should work quite well on bigger data and should be much faster than reshape2:::dcast as well.


Alternatively, you can try the reshape:::cast version which may or may not crash... Try it!!

require(reshape)
input <- data.table( input )
cast( input , Customer ~ Product , fun = sum , value = .(Number) )
Arun
  • 116,683
  • 26
  • 284
  • 387
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Fast respons, this works indeed. However it makes my R session crash instandly. I updated my question. – Freddy Oct 22 '13 at 14:49
  • @Freddy see update. Use a `data.table`. dcast methods have been written for these. Should be fast and avoid crashing your computer! – Simon O'Hanlon Oct 22 '13 at 14:54
  • @Freddy but with a `data.table` rather than a `data.frame`? That is an important difference! Also, if you don't assign the output to a variable it will try to print to console and it will probably crash, so try and copy/paste this... `input <- data.table( input ); output <- dcast( input , Customer ~ Product , fun = sum , value.var = "Number" )` – Simon O'Hanlon Oct 22 '13 at 15:05
  • Hmmmm, why don't you try with a subset of the data, e.g. the first `100k` rows? e.g. `dcast( input[ 1:1e5 , ] ... )` see if that works and build up from there? – Simon O'Hanlon Oct 22 '13 at 15:10
  • @Freddy perhaps you are seeing [**this problem**](http://stackoverflow.com/a/16136390/1478381)... too many groups. – Simon O'Hanlon Oct 22 '13 at 15:19
  • yes that looks like the same problem. A problem i should had seen coming, as juba mentioned. Thanks for you effort, I accepted juba's answer for now as it only uses base R. I will check the performance of both methods and accept the fastest :) – Freddy Oct 22 '13 at 15:21
  • yes that works! It gives a "Error: cannot allocate vector of size 630.8 Mb" which was expected now. But it does not crash my R anymore. – Freddy Oct 22 '13 at 15:28
  • Alternatively, if they can get to the `input_agg` step from their question, why not just use `dcast` on that? `dcast(input_agg, Customer ~ Product, value.var="Number", fill = 0)` – A5C1D2H2I1M1N2O1R2T1 Oct 22 '13 at 17:24
  • @AnandaMahto they can't. Check the edits. I believe that crashes. – Simon O'Hanlon Oct 22 '13 at 17:26
  • @SimonO101, made some edits, cleared up some things better. Hope it's alright. – Arun Oct 22 '13 at 19:23
  • @Arun thanks very much for clearing that up. Hopefully this solution will work now... – Simon O'Hanlon Oct 22 '13 at 21:23
  • @Freddy see Arun's edit... use `dcast.data.table` not `dcast`. This will hopefully work now. – Simon O'Hanlon Oct 22 '13 at 21:23