tidyr spread function generates sparse matrix when compact vector expected

Question

I'm learning dplyr, having come from plyr, and I want to generate (per group) columns (per interaction) from the output of xtabs.

Short summary: I'm getting

A    B
1    NA
NA   2

when I wanted

A    B
1    2

xtabs data looks like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T)))
       A
P       FALSE TRUE
  FALSE     1    2
  TRUE      1    1

now do( wants it's data in data frames, like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% as.data.frame
      P     A Freq
1 FALSE FALSE    1
2  TRUE FALSE    1
3 FALSE  TRUE    2
4  TRUE  TRUE    1

Now I want a single row output with columns being the interaction of levels. Here's what I'm looking for:

FALSE_FALSE TRUE_TRUE FALSE_TRUE TRUE_FALSE
          1         1          2          1

But instead I get

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% 
    unite(S,A,P) %>% 
    spread(S,Freq)
  FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1           1         NA         NA        NA
2          NA          1         NA        NA
3          NA         NA          2        NA
4          NA         NA         NA         1

I'm clearly misunderstanding something here. I'm looking for the equivalent of reshape2's code here (using magrittr pipes for consistency):

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% # can be omitted. (safely??)
    melt %>% 
    mutate(S=interaction(P,A),value=value) %>% 
    dcast(NA~S)
Using P, A as id variables
  NA FALSE.FALSE TRUE.FALSE FALSE.TRUE TRUE.TRUE
1 NA           1          1          2         1

(note NA is used here because I don't have a grouping variable in this simplified example)

Update - interestingly, adding a single grouping column seems to fix this - why does it synthesise (presumably from row_name) a grouping column without me telling it?

> xtabs(data=data.frame(h="foo",P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
  as.data.frame %>% 
  unite(S,A,P) %>% 
  spread(S,Freq)
    h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 foo           1          1          2         1

This seems like a partial solution.

[**This**](https://github.com/hadley/tidyr/issues/41) seems like the same issue. — Henrik, Dec 16 '14 at 10:15
[This](http://stackoverflow.com/q/25960394/937932) is the same issue in reverse, with an explanatory comment by Hadley. As you discovered in your update, both outputs make sense in the right context. When the context is only implicit, `spread()` has to guess. — nacnudus, Dec 16 '14 at 10:53
@nacnudus: Thanks for your helpful pointer. I disagree in this case - I didn't discover that the expanded case makes sense - just that it existed. Where there are NO arguments/columns from which to guess, My expectation is that it will assume that there is a single global identity. Can you explain why this might not be true? — Alex Brown, Dec 16 '14 at 11:00

score 6 · Accepted Answer · edited Mar 31 '15 at 14:46

The key here is that spread doesn't aggregate the data.

Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
    unite(S,A,P)
a
##             S Freq
## 1 FALSE_FALSE    1
## 2  FALSE_TRUE    1
## 3  TRUE_FALSE    1
## 4   TRUE_TRUE    1
## 5  TRUE_FALSE    1

a %>% spread(S, Freq)
##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1           1         NA         NA        NA
## 2          NA          1         NA        NA
## 3          NA         NA          1        NA
## 4          NA         NA         NA         1
## 5          NA         NA          1        NA

Which wouldn't make sense any other way (without aggregation).

This is predictable based on the help file for the fill parameter:

If there isn't a value for every combination of the other variables and the key column, this value will be substituted.

In your case, there aren't any other variables to combine with the key column. Had there been, then...

b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                , h = rep(c("foo", "bar"), length.out = 5)) %>% 
    unite(S,A,P)
b
##             S Freq   h
## 1 FALSE_FALSE    1 foo
## 2  FALSE_TRUE    1 bar
## 3  TRUE_FALSE    1 foo
## 4   TRUE_TRUE    1 bar
## 5  TRUE_FALSE    1 foo

> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)

...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
## 
##     h           S Freq
## 1 bar  FALSE_TRUE    1
## 2 bar   TRUE_TRUE    1
## 3 foo FALSE_FALSE    1
## 4 foo  TRUE_FALSE    2

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq)) %>%
    spread(S, Freq)
## Source: local data frame [2 x 5]
## 
##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar          NA          1         NA         1
## 2 foo           1         NA          2        NA

But when used in combination with dplyr groups when grouping the maximal identity set we have an implied aggregation: which iirc does not in fact operate correctly. — Alex Brown, Jan 17 '15 at 04:04
What is a maximal identity set? I don't think there's any alternative to supplying a dummy grouping variable. You can do it in the original data frame, or you could do `group_by(1)` before `spread` and then `select(-`1`)` afterwards. — nacnudus, Jan 17 '15 at 10:38
I mean once all variables other than key and value have been consumed as 'enumerators' in group_by operations. Thanks anyway — Alex Brown, Jan 18 '15 at 01:31
@nacnudus I just wanted to say your comment of `group_by(1)` really helped me. — Alex, Feb 25 '15 at 04:16

tidyr spread function generates sparse matrix when compact vector expected

1 Answers1

Linked