3

I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group.

Here is an example:

require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
#   id order content
# 1  a     5    A, B
# 2  a     7 A, B, C
# 3  a     3    D, E
# 4  b     1    A, B
# 5  b     9    A, B
# 6  b     4 A, B, C

My desired output (after sorting by order descending within each group) would be like:

#   id order content cc
# 1  a     7 A, B, C 3
# 2  a     5    A, B 3
# 3  a     3    D, E 5
# 4  b     9    A, B 2
# 5  b     4 A, B, C 3
# 6  b     1    A, B 3

cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. Here is my attempted solution that doesn't work:

res <- testdf %>% 
  arrange(id, desc(order)) %>% 
  mutate(n=row_number()) %>%
  group_by(id) %>%
  mutate(n1=first(n)) %>%
  rowwise() %>%
  bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
  data.frame

I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). The values generated seem to be correct but they are not associated with the correct rows from the source data frame:

res
#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    2
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    2
# 6  b     1    A, B 6  4    3

As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed.

Actually I worked out what is wrong above, the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. Originally I'd had .$content instead of testdf$content and that had produced even odder output. So I tried doing it in two stages:

res <- testdf %>% 
    arrange(id, desc(order)) %>% 
    mutate(n=row_number()) %>%
    group_by(id) %>%
    mutate(n1=first(n))
res <- res %>% 
    rowwise() %>%
    bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
    data.frame

and this produces what I expect:

#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    3
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    3
# 6  b     1    A, B 6  4    3

So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do() (so that content is ordered correctly) - I think . is just the current row isn't it? Being able to do so would avoid me having to create the ordered data frame separately before the do().

Many thanks

Tim

Community
  • 1
  • 1
Tim
  • 65
  • 5
  • 1
    I'm a bit confused with all the steps, but assuming you have ordered and grouped your data appropriately, you could use `cumsum(!duplicated(unlist(x)))[cumsum(lengths(x))]` to count cumulatively, where `x` is the ordered "content" -- e.g. `list(c("A", "B", "C"), c("A", "B"), c("D", "E"))` for the ordered "content" in group "a" and `list(c("A", "B"), c("A", "B", "C"), c("A", "B"))` in group "b". – alexis_laz Oct 05 '16 at 21:12
  • Thanks for your reply - I had a quick go but I'm not sure where to try that, should it replace the whole `rowwise()` and `bind_cols(do())`? I naively tried `res %>% cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))]` which gave NA's? – Tim Oct 05 '16 at 23:38
  • 1
    Following your code, I had something like `testdf %>% arrange(id, desc(order)) %>% group_by(id) %>% mutate(cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))])` in mind – alexis_laz Oct 06 '16 at 08:54
  • Ok thanks - that worked for me although I'm not quite sure how the selection from the list using the cumsum(lengths(content)) works. I think @Psidom's solution is perhaps easier to understand and so I'll accept that as the solution. Thanks again for contributing. – Tim Oct 06 '16 at 23:23
  • By the way - I like your use of methods with "cumulative" naming for this purpose although the double use is a bit confusing. – Tim Oct 06 '16 at 23:29

1 Answers1

1

You can use the Reduce function with the accumulate mode to create cumulatively distinct elements and then use lengths function to return the cumulative distinct counts, this avoids the rowwise() operation:

library(dplyr)
testdf %>% 
          arrange(desc(order)) %>% 
          group_by(id) %>% 
          mutate(cc = lengths(Reduce(function(x, y) unique(c(x, y)), content, acc = T))) %>% 
          arrange(id)

#Source: local data frame [6 x 4]
#Groups: id [2]

#      id order   content    cc
#  <fctr> <dbl>    <list> <int>
#1      a     7 <chr [3]>     3
#2      a     5 <chr [2]>     3
#3      a     3 <chr [2]>     5
#4      b     9 <chr [2]>     2
#5      b     4 <chr [3]>     3
#6      b     1 <chr [2]>     3
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks, that's a great solution! Is there a rule of thumb for when rowwise is required vs being able to use a vectorised solution? – Tim Oct 06 '16 at 23:28
  • I am not sure if there is one, but to avoid using rowwise operation whenever you can vectorise it would be my rule of thumb, since rowwise operation is usually expensive. – Psidom Oct 06 '16 at 23:42
  • Can I ask if the order of the first arrange and group_by above is significant. One might think that arrange after group_by would arrange within the groups, but I'm not sure if that works as expected? Thanks. – Tim Oct 11 '16 at 18:55
  • 1
    The order of `group_by` and `arrange` shouldn't affect as far as I know. I can't find strong support for this. But you may find this to be useful: https://blog.rstudio.org/2016/06/27/dplyr-0-5-0/ – Psidom Oct 11 '16 at 19:02
  • Ah, yes interesting, so it used to make a difference but not now as arrange ignores grouping after 0.5.0, perhaps I've seen some old posts on that. Thanks. – Tim Oct 11 '16 at 23:40