18

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.

Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):

library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id", 
        fun.aggregate = length, fill = 0, .parallel = TRUE)

However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?

Input:

id      week
1       1
1       2
1       3
1       1
2       3

Output:

  1  2  3
1 2  1  1
2 0  0  1
Henrik
  • 65,555
  • 14
  • 143
  • 159
user592419
  • 5,103
  • 9
  • 42
  • 67

4 Answers4

22

You could just use the table command:

table(data$id,data$week)

    1 2 3
  1 2 1 1
  2 0 0 1

If "id" and "week" are the only columns in your data frame, you can simply use:

table(data)
#    week
# id  1 2 3
#   1 2 1 1
#   2 0 0 1
Henrik
  • 65,555
  • 14
  • 143
  • 159
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • +1 Blast. You have a knack of making my solutions look totally long-winded, roundabout and pedestrian. – Andrie Nov 18 '11 at 17:17
  • 3
    If you have a lot of data and operations that can't be simplified so much, then the 'data.table' package may help you. – Patrick Burns Nov 18 '11 at 18:15
16

You don't need ddply for this. The dcast from reshape2 is sufficient:

dat <- data.frame(
    id = c(rep(1, 4), 2),
    week = c(1:3, 1, 3)
)

library(reshape2)
dcast(dat, id~week, fun.aggregate=length)

  id 1 2 3
1  1 2 1 1
2  2 0 0 1

Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:

xtabs(~id+week, data=dat)

   week
id  1 2 3
  1 2 1 1
  2 0 0 1
Andrie
  • 176,377
  • 47
  • 447
  • 496
11

The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.

An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:

library(data.table) 
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
#    id 1 2 3
# 1:  1 2 1 1
# 2:  2 0 0 1

Or setting the arguments explicitly:

dcast(setDT(data), id ~ week, value.var = "week", fun = length)
#    id 1 2 3
# 1:  1 2 1 1
# 2:  2 0 0 1

For pre-data.table 1.9.2 alternatives, see edits.

Henrik
  • 65,555
  • 14
  • 143
  • 159
mnel
  • 113,303
  • 27
  • 265
  • 254
2

A tidyverse option could be :

library(dplyr)
library(tidyr)

df %>%
  count(id, week) %>%
  pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
  #spread(week, n, fill = 0) #In older version of tidyr

#     id   `1`   `2`   `3`
#   <dbl> <dbl> <dbl> <dbl>
#1     1     2     1     1
#2     2     0     0     1

Using only pivot_wider -

tidyr::pivot_wider(df, names_from = week, 
                   values_from = week, values_fn = length, values_fill = 0)

Or using tabyl from janitor :

janitor::tabyl(df, id, week)
# id 1 2 3
#  1 2 1 1
#  2 0 0 1

data

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L, 
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213