6

Short version

How to do the operation

df1 %>% spread(groupid, value, fill = 0) %>% gather(groupid, value, one, two)

in a more natural way?

Long version

Given a data frame

df1 <- data.frame(groupid = c("one","one","one","two","two","two", "one"),
                  value = c(3,2,1,2,3,1,22),
                  itemid = c(1:6, 6))

for many itemid and groupid pairs we have a value, for some itemids there are groupids where there is no value. I want to add a default value for those cases. E.g. for the itemid 1 and groupid "two" there is no value, I want to add a row where this gets a default value.

The following tidyr code achieves this, but it feels like a strange way to do it (the default value added here is 0).

df1 %>% spread(groupid, value, fill = 0) %>% gather(groupid, value, one, two)

I am looking for suggestions on how to do this in a more natural way.

Since in some weeks looking at the above code I would likely be confused about its effect I wrote a function wrapping it:

#' Add default values for missing groups
#' 
#' Given data about items where each item is identified by an id, and every
#' item can have a value in every group; add a default value for all groups
#' where an item doesn't have a value yet.
add_default_value <- function(data, id, group, value, default) {
  id = as.character(substitute(id))
  group = as.character(substitute(group))
  value = as.character(substitute(value))
  groups <- unique(as.character(data[[group]]))

  # spread checks that the columns outside of group and value uniquely
  # determine the row.  Here we check that that already is the case within
  # each group using only id.  I.e. there is no repeated (id, group).
  id_group_cts <- data %>% group_by_(id, group) %>% do(data.frame(.ct = nrow(.)))
  if (any(id_group_cts$.ct > 1)) {
    badline <- id_group_cts %>% filter(.ct > 1) %>% top_n(1, .ct)
    stop("There is at least one (", id, ", ", group, ")",
         " combination with two members: (",
         as.character(badline[[id]]), ", ", as.character(badline[[group]]), ")")
  }

  gather_(spread_(data, group, value, fill = default), group, value, groups)
}

Last note: reason for wanting this is, my groups are ordered (week1, week2, ...) and I am looking to have every id have a value in every group so that after sorting the groups per id I can use cumsum to get a weekly running total that is also shown in the weeks where the running total didn't increase.

kasterma
  • 4,259
  • 1
  • 20
  • 27
  • 2
    You could use `left_join` with `expand.grid` and then replace NA with 0. I don't know whether that would be more natural than the `spread/gather` approach `left_join(expand.grid(groupid=unique(df1$groupid), itemid=unique(df1$itemid)), df1)` or `library(data.table); setkey(setDT(df1), groupid, itemid)[CJ(groupid=unique(groupid), itemid=unique(itemid))][is.na(value), value:=0][]` – akrun May 20 '15 at 17:51
  • `expand.grid` was also the solution suggested in this [related post](http://stackoverflow.com/questions/22059308/how-to-expand-a-large-dataframe-in-r) – shadow May 21 '15 at 09:28
  • 2
    This question seems fairly similar to [a question I asked](https://stackoverflow.com/questions/23451372/add-rows-to-grouped-data-with-dplyr) some time ago. The best answer there was to use `xtabs` - for this example: `df1 %>% xtabs(formula = value ~ itemid + groupid) %>% data.frame` would work, or without pipes: `as.data.frame(xtabs(value ~ itemid + groupid, data = df1))` – talat May 21 '15 at 10:53
  • xtabs looks good for this. I'll have to think a little more, but it almost looks like my question is about an implementation of xtabs. – kasterma May 22 '15 at 08:05

2 Answers2

3

There is a new function complete in the development version of tidyr that does this.

df1 %>% complete(itemid, groupid, fill = list(value = 0))
##    itemid groupid value
## 1       1     one     3
## 2       1     two     0
## 3       2     one     2
## 4       2     two     0
## 5       3     one     1
## 6       3     two     0
## 7       4     one     0
## 8       4     two     2
## 9       5     one     0
## 10      5     two     3
## 11      6     one    22
## 12      6     two     1
shadow
  • 21,823
  • 4
  • 63
  • 77
2

One possibility is to use expand from tidyr. This approach is very similar to the expand.grid idea of @akrun (it actually uses expand.grid internally). I used the dplyr package for the housekeeping after joining the expanded data with the original data.

This approach is longer than the spread/gather approach. Personally I find it a bit more clear what is going on. In my (rather small) benchmark test, spread/gather performed slightly better than expand/join.

# expand first
expand(df1, itemid, groupid) %>% 
  # then join back to data
  left_join(df1, by = c("itemid", "groupid")) %>%
  # because there is no fill argument in join
  mutate(value = ifelse(is.na(value), 0, value)) %>%
  # rearange
  arrange(groupid, itemid)
shadow
  • 21,823
  • 4
  • 63
  • 77
  • I did not know about expand, thanks. Unfortunately the expand, left_join, mutate combination still needs a wrapper in a function (like e.g. add_default_value in the question) to be comprehensible to me in a week or two. – kasterma May 22 '15 at 07:54