3

I am working on dataframe transformations and was working Arun and Ricardo on a previous post

Previous Post

Arun, suggested a brilliant solution ( matrix multiplication ) to achieve what i was trying to do.

That solution worked for a small data set like what i mentioned in the example, now i am running the same solution on a data frame which has the following sizes:

Total rows: 143345
Total Persons: 98461
Total Items :  30

Now, when i run the following command

 A <- acast(Person~Item+BorS,data=df,fun.aggregate=length,drop=FALSE)

I get this error..

Error: segfault from C stack overflow

Is this because, i dont have enough processing/memory power. My machine has 4 GB RAM, 2.8 GHz i7 processor ( Macbook) ? How do we handle these type of cases ?

Community
  • 1
  • 1
user2171177
  • 75
  • 1
  • 6

1 Answers1

4

A data.table solution. This works by aggregating first, then creating the new data.table and filling in by reference

library(data.table)

# some sample data
DT <- data.table(Person = sample(98461, 144000, replace = TRUE), item = sample(c(letters,LETTERS[1:4]), 144000, replace = TRUE), BorS = sample(c('B','S'), 144000, replace = TRUE))
# aggregate to get the number of rows in each subgroup by list item and BorS 
# the `length` of each subgroup
DTl <- DT[,.N , by = list(Person, item, BorS)]
# the columns you want to create
newn <- sort(DT[, do.call(paste0,do.call(expand.grid,list(unique(item),unique(BorS) )))])
# create a column which has this id combination in DTl
DTl[, comnb := paste0(item, BorS)]
# set the key so we can join / subset easily
setkey(DTl, comnb)
# create a data.table that has 1 row for each person, and has  columns for all the combinations
# of item and BorS
DTb <- DTl[, list(Person)][, c(newn) := 0L]
# set the key so we can join / subset easily
setkey(DTb, Person)
# this bit could be far quicker, but I think
# would require a feature request to data.table
for(nn in newn){
  # for each of the cominations extract which persons have
  # this combination >0
  pp <- DTl[list(nn), list(Person,N)]
  # for the people who have N > 0
  # assign the correct numbers in the correct column in DTb
  DTb[list(pp[['Person']]), c(nn) := pp[['N']]]
}

To complete you initital problem, you can extract the appropriate columns from DTb as a matrix

A <- DTb[,-1,with = FALSE]

results <- crossprod(A)
mnel
  • 113,303
  • 27
  • 265
  • 254
  • can you please explain your solution starting with the second line? – user2171177 Mar 15 '13 at 03:27
  • @user2171177 -- I hope that makes a bit more sense. – mnel Mar 15 '13 at 03:32
  • 1
    @user2171177 -- No need to apologise. The answer is far better with the explanation there – mnel Mar 15 '13 at 03:54
  • what is .N in DTl <- DT[,.N , by = list(Person, item, BorS)] ? Is it the number of occurrences ? – user2171177 Mar 15 '13 at 04:04
  • It is a `data.table` special value which is the number of rows in the subset of the data.table (see `?data.table`, there are a few others too.) – mnel Mar 15 '13 at 04:05
  • My requirement is to generate a matrix of every combination of item and BorS. The above solution seems like displaying a matrix of only those combinations where there is an occurrence in the dataset. for example: if there is no entry for Item: a and BorS: B, i am not seeing aB anywhere in the matrix. – user2171177 Mar 15 '13 at 04:34
  • Well, if you know the columns you want to create (all possible combinations), then recreate `newn` to be a character vector of those names. – mnel Mar 15 '13 at 04:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26213/discussion-between-user2171177-and-mnel) – user2171177 Mar 15 '13 at 04:46
  • Adding DTb<-unique(DTb) before the crossprod, actually helped achieve what i was looking for. – user2171177 Mar 15 '13 at 16:00