This question probably needs solving with data.table
or dplyr
. I have a dataset (data frame) which looks like this:
summary(mooc_events)
signup_id time source event
Min. : 1 Min. :2013-10-27 browser:3869940 access :3112191
1st Qu.: 18721 1st Qu.:2013-12-19 server :4287337 discussion: 649259
Median : 48331 Median :2014-05-30 navigate :1009309
Mean : 63476 Mean :2014-04-05 page_close:1237883
3rd Qu.:110375 3rd Qu.:2014-06-15 problem :1261170
Max. :200905 Max. :2014-08-01 video : 796958
wiki : 90507
artefact_sha
Length:8157277
Class :character
Mode :character
One signup_id have multiple events, so there are many rows which starts with the same signup_id.
What I'm trying to achieve is to get an aggregated dataset (data.table or data frame) with as much columns as there are distinct values per particular column, all grouped by signup_id
, so for this data it would look like this:
signup_id, source_browser, source_server, event_access, event_discussion, ... , event_wiki, artefact_sha_{first_element_in_whole_dataset}, ..., artefact_sha_{last_element_in_whole_dataset}
1, 23, 37, 9, 0, ..., 3, 7, ..., 1
2, 2, 7, 2, 2, ..., 1, 0, ..., 0
in other words, it's counting occurrences for given set of columns, grouped by single column signup_id
, I'm not interesting in grouping by e.g. both signup_id and source.
Column naming is not strict (_
can be substituted by anything which makes sense).
(let's skip the time column at the moment)
Best regards and thanks upfront.