2

This question probably needs solving with data.table or dplyr. I have a dataset (data frame) which looks like this:

summary(mooc_events)
 signup_id         time                source               event        
 Min.   :     1   Min.   :2013-10-27   browser:3869940   access    :3112191  
 1st Qu.: 18721   1st Qu.:2013-12-19   server :4287337   discussion: 649259  
 Median : 48331   Median :2014-05-30                     navigate  :1009309  
 Mean   : 63476   Mean   :2014-04-05                     page_close:1237883  
 3rd Qu.:110375   3rd Qu.:2014-06-15                     problem   :1261170  
 Max.   :200905   Max.   :2014-08-01                     video     : 796958  
                                                         wiki      :  90507  
    artefact_sha         
 Length:8157277    
 Class :character  
 Mode  :character  

One signup_id have multiple events, so there are many rows which starts with the same signup_id.

What I'm trying to achieve is to get an aggregated dataset (data.table or data frame) with as much columns as there are distinct values per particular column, all grouped by signup_id, so for this data it would look like this:

signup_id, source_browser, source_server, event_access, event_discussion, ... , event_wiki, artefact_sha_{first_element_in_whole_dataset}, ..., artefact_sha_{last_element_in_whole_dataset}

1, 23, 37, 9, 0, ..., 3, 7, ..., 1
2, 2, 7, 2, 2, ..., 1, 0, ..., 0

in other words, it's counting occurrences for given set of columns, grouped by single column signup_id, I'm not interesting in grouping by e.g. both signup_id and source.

Column naming is not strict (_ can be substituted by anything which makes sense).

(let's skip the time column at the moment)

Best regards and thanks upfront.

oski86
  • 855
  • 1
  • 13
  • 35
  • I can't think of a problem that `dplyr` solves that you _couldn't_ solve in base R, unless it's performance-related. That does not, however, mean that you _should_ use base R to solve it. – shadowtalker Jun 13 '15 at 23:10
  • 1
    Also, [this answer](http://stackoverflow.com/a/25812054/2954547) combined with [`group_by`](http://www.rdocumentation.org/packages/dplyr/functions/group_by) should give you what you need if you want to go the `dplyr` route. If you manage to figure it out on your own, I'd encourage you to post the solution as an answer to your own question. It will help future users. – shadowtalker Jun 13 '15 at 23:14
  • I mentioned data.frame and dplyr because of both performance and code readability. I realize that base R is very often enough. I just saw lot of solutions in those packages and I thought it's worth mentioning names of packages.. – oski86 Jun 13 '15 at 23:14
  • I'll check it out, thanks for providing link to http://stackoverflow.com/questions/25811756/summarizing-counts-of-a-factor-with-dplyr/25812054#25812054 – oski86 Jun 13 '15 at 23:16
  • is `artefact_sha` unique for every row? – Rorschach Jun 13 '15 at 23:53
  • there are actually around 5 thousand unique `artefact_sha`s in the dataset so I think I will exclude it as well and simply make one column with an aggregate - count unique `artefact_sha` per `signup_id`. – oski86 Jun 14 '15 at 13:32

2 Answers2

3

It's more like a reshaping problem which could be solved using the tidyr and reshape2 libraries.

Reshaping with tidyr and counting occurences with reshape2 :

My example doesn't include artefact_sha because I didn't understand what you want to do with it.

library(dplyr) # Or library(magrittr) for the pipe syntax
library(tidyr)
library(reshape2)

set.seed(42)
mooc_events <- data.frame(signup_id = rep(1:3, each = 5), 
                    time = Sys.Date(), 
                    source = sample(c("browser", "server"), 15, rep = TRUE), 
                    event = sample(c("access", "discussion", "navigate"), 15, rep = TRUE), 
                    stringsAsFactors = FALSE)

mooc_events.m <- 
  mooc_events %>% 
  gather(key, value, -c(signup_id, time)) %>% 
  unite(var, key, value, sep = "_")

myTable <- dcast(mooc_events.m, signup_id ~ var, fun.aggregate = length)

> myTable
  signup_id event_access event_discussion event_navigate source_browser source_server
1         1            1                2              2              1             4
2         2            2                0              3              1             4
3         3            0                3              2              3             2
Julien Navarre
  • 7,653
  • 3
  • 42
  • 69
  • Sweet. This is what I was looking for, and for sure I need to spend some time studying this feature. After piping gather and unite I get `Warning message: attributes are not identical across measure variables; they will be dropped `. Should I be worried about it ? And after dcast: `Using var as value column: use value.var to override.` Result data looks really fine, but I don't get the warnings. – oski86 Jun 14 '15 at 13:25
  • 1
    The 1st warning result from the using of factors see : http://stackoverflow.com/questions/25688897/reshape2-melt-warning-message, be carefull with this one you maybe can experience strange behaviours. And the second message is OK, (it's not a warning), though you can hide it using `value.var = "var"` in `dcast` – Julien Navarre Jun 14 '15 at 14:06
1

Perhaps this will work. It's a combination of dplyr and reshape2. This only generates some of the variables you have. To include the rest of the variable you want to count across, just add them into the group_by call and the dcast, ie. dcast(tst, signup_id ~ source+event+...)

library(dplyr)
library(reshape2)

## First get counts for groupings of variables
tst <- mooc_events %>% group_by(signup_id, source, event) %>%
  dplyr::summarise(count=n())

## Then reshape data long -> wide
dcast(tst, signup_id ~ source+event)

#    signup_id browser_access browser_navigate browser_video browser_wiki
# 1          1              2               NA             1            2
# 2          2             NA               NA             2           NA
# 3          3              3                1            NA            3
# ...
#    server_access server_navigate server_video server_wiki
# 1             NA               1            3           1
# 2              3               2           NA           1
# 3             NA               4           NA           5

## Some sample data
mooc_events <- data.frame(
    signup_id=sample(1:10, 100, replace=T), 
    source=factor(sample(c("browser", "server"), 100, replace=T)),
    event=factor(sample(c("access","navigate","video","wiki"), 100, replace=T))
)
head(mooc_events)

#   signup_id  source    event
# 1         5 browser     wiki
# 2         4  server navigate
# 3         1 browser navigate
# 4         7 browser   access
# 5         8  server   access
# 6         5 browser     wiki
Rorschach
  • 31,301
  • 5
  • 78
  • 129