Multiple uses of setdiff() on consecutive groups without for looping

Question

I would like to setdiff between consecutive groups without for looping, if possible with a datatable way or a function of apply family.

Dataframe df :

   id group
1  L1     1
2  L2     1
3  L1     2
4  L3     2
5  L4     2
6  L3     3
7  L5     3
8  L6     3
9  L1     4
10 L4     4
11 L2     5

I want to know how much new ids there are between consecutive groups. So, for example, if we compare group 1 and 2, there are two new ids : L3 and L4 so it returns 2 (not with setdiff directly but with length()), if we compare group 2 and 3, L5 and L6 are the news ids so it returns 2 and so on.

Expected results :

new_id
  2
  2
  2
  1

Data :

structure(list(id = structure(c(1L, 2L, 1L, 3L, 4L, 3L, 5L, 6L, 
1L, 4L, 2L), .Label = c("L1", "L2", "L3", "L4", "L5", "L6"), class = "factor"), 
    group = c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5)), class = "data.frame", row.names = c(NA, 
-11L), .Names = c("id", "group"))

You could, also, build something off of [this post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses) -- e.g. `tab = table(df) > 0; (colSums(tab) - crossprod(tab))[cbind(2:5, 1:4)]` (and adjust the hardcoding in subsetting accordingly) — alexis_laz, Apr 06 '17 at 16:09

score 3 · Accepted Answer · answered Apr 06 '17 at 15:00

3

Here is an option with mapply:

lst <- with(df, split(id, group))   
mapply(function(x, y) length(setdiff(y, x)), head(lst, -1), tail(lst, -1))

#1 2 3 4 
#2 2 2 1

answered Apr 06 '17 at 15:00

Psidom

209,562
33
339
356

mt1022 · Answer 2 · 2017-04-06T15:23:22.173

2

Here is a data.table way with merge. Suppose the original data.frame is named dt:

library(data.table)

setDT(dt)
dt2 <- copy(dt)[, group := group + 1]

merge(
    dt, dt2, by = 'group', allow.cartesian = T
)[, .(n = length(setdiff(id.x, id.y))), by = group]

#    group n
# 1:     2 2
# 2:     3 2
# 3:     4 2
# 4:     5 1

edited Apr 06 '17 at 15:23

answered Apr 06 '17 at 15:15

mt1022

16,834
5
48
71

1

This can be simplified using an anti join and .N instead of length-setdiff: `d[!.(group = group + 1L, id = id), on=.(group, id), .N, by=group]` – Frank Apr 06 '17 at 15:46
1

@Frank, using not-join is much concise than my answer. I always learn a lot from your answers related to data.table. – mt1022 Apr 06 '17 at 15:57
@Frank Could you explain a bit this part of code ? : `[!.(group = group + 1L, id = id), on=.(group, id)]` ? – Mbr Mbr Apr 07 '17 at 08:13
@MbrMbr It is an anti-join against the table containing tuples of the form group+1, id, where group and id are drawn from the table itself. There is not a vignette on joins yet, but my notes here may help with the intuition: http://franknarf1.github.io/r-tutorial/_book/tables.html#dt-joins – Frank Apr 07 '17 at 13:06

d.b · Answer 3 · 2017-04-06T14:59:19.357

1

L = split(d, d$group) #Split data ('d') by group and create a list

#use lapply to access 'id' for each sub group in the list and obtain setdiff
sapply(2:length(L), function(i)
     setNames(length(setdiff(L[[i]][,1], L[[i-1]][,1])),
     nm = paste(names(L)[i], names(L)[i-1], sep = "-")))
#2-1 3-2 4-3 5-4 
#  2   2   2   1

edited Apr 06 '17 at 14:59

answered Apr 06 '17 at 14:43

d.b

32,245
6
36
77

score 1 · Answer 4 · answered Apr 06 '17 at 14:58

You could use Reduce to run a comparison function on pairwise elements in a list. For example

xx<-Reduce(function(a, b) {
    x <- setdiff(b$id, a$id); 
    list(id=b$id, new=x, newcount=length(x))
  }, split(df, df$group), 
  acc=TRUE)[-1]

Then you can get the counts of new elements out with

sapply(xx, '[[', "newcount")

and you can get the new values with

sapply(xx, '[[', "new")

Multiple uses of setdiff() on consecutive groups without for looping

4 Answers4