1

I have a data-frame called predictors with columns as session_id and item_id.

I want to calculate the counts (in the whole data frame) for all items that belong to one particular session.

I have used aggregate method like this :

popularity <- aggregate(predictors$item_id, 
                        FUN = function(items) {(table(predictors$item_id[predictors$item_id %in% items]))}, 
                        by = list(predictors$session_id))

Which basically calculates the list of counts (through out predictors) of all items that belong to one particular session.

e.g. If there are two records as session1 - item1 and session1 - item2, I would like to get the list of counts (in the whole predictors dataframe) of item1 and item2 against session1. (something like session1 - (10, 20), when item1 appears 10 times in the dataset, and so on).

I am getting this using above aggregate method. But I want to make it work faster using data.table.

Till now I have tried with data.table as follows :

predictors_data.table <- data.table(predictors)
popularity <- predictors_data.table[ , list(p = table(predictors_data.table$item_id[items_list %in% item_id])), 
                                       by = c('session_id')]

but I am only getting count for first item and not all items for one particular session.

exAres
  • 4,806
  • 16
  • 53
  • 95
  • 2
    Please show a small example data, and the desired result. Also I'd recommend starting with new [Introduction to data.table](https://github.com/Rdatatable/data.table/wiki/Getting-started) HTML vignette. It should only take about 10 minutes... – Arun May 09 '15 at 09:39

2 Answers2

2

Here is a simple way of achieving this using dplyr:

# devtools::install_github("trinker/wakefield")
library(wakefield)

wakefield::r_data_frame(n = 1000,
  session_id = r_sample(x = 1:10),
  item_id = r_sample(x = 1:10)
) %>%
  dplyr::count(item_id, session_id)

Which gives the output:

Source: local data frame [100 x 3]
Groups: item_id

   item_id session_id  n
1        1          1  7
2        1          2 12
3        1          3 14
4        1          4  6
5        1          5 14
6        1          6  9
7        1          7  8
8        1          8  4
9        1          9  9
10       1         10  6
..     ...        ... ..
tchakravarty
  • 10,736
  • 12
  • 72
  • 116
  • thanks for the answer..as I am working on a large dataset, I am thinking of a way to do this using data.table – exAres May 09 '15 at 09:34
  • 1
    @Sangram This would work on a `data.table`. You might want to read [this](http://stackoverflow.com/questions/27511604/dplyr-on-data-table-am-i-really-using-data-table) though. – tchakravarty May 09 '15 at 09:37
2

Here's the data.table analogue of the table function:

predictors_data.table[,.N,by=c("session_id","item_id")]
#    session_id item_id   N
# 1:          1       1 106
# 2:          1       2  99
# 3:          1       3 115
# 4:          2       1 121
# 5:          2       2 110
# 6:          2       3 115
# 7:          3       1 122
# 8:          3       2 103
# 9:          3       3 109

However, table is a lot better visually; don't you want to see the margins?

with(predictors,table(session_id,item_id))
# or...
with(predictors_data.table,table(session_id,item_id))
#           item_id
# session_id   1   2   3
#          1 106  99 115
#          2 121 110 115
#          3 122 103 109

If you're just running this code once, I see no reason to prefer .N to table. If you want to store it, though, predictors_data.table[,count:=.N,by=c("session_id","item_id")] can be handy.


Example data, copying @fgnu:

 require(wakefield)
 set.seed(1)
 predictors <- wakefield::r_data_frame(
   n = 1000,
   session_id = r_sample(x = 1:3),
   item_id = r_sample(x = 1:3)
 )
Frank
  • 66,179
  • 8
  • 96
  • 180