Flagging groups in which all members fulfill a certain requirement in R

Question

Suppose the data below:

GroupId <-          c(1,1,1,1,2,2,2,3,3)
IndId <-            c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <-     c(1,2,1,2,2,2,1,2,2)

df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df

These are multi-level data, where each group GroupId consists of one or multiple individuals IndId having access to one or more properties IndGroupProperty, which are unique to their respective group (i.e. property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType.

The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.

For our sample data, this simply is:

ValidGroup <-       c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df

The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).

I have looked into several commands: levels seems to lack group support; getGroups in the nlme package does not like the input of my real data; I guess that there might be something useful in doBy, but summaryBy does not seem to take levels as a function.

Solution EDIT: dplyr solution by Henrik wrapped into a function:

foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group)) 
indvar <- deparse(substitute(ind)) 
typevar <- deparse(substitute(type)) 
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
  group_by(group, ind) %.%
  mutate(type1 = any(type == 1))  %.%
  group_by(group, add = FALSE) %.%
  mutate(ValidGroup = all(type1) * 1) %.%
  select(-type1)
  ))
}

A nicely posed question like this (with reproducible code, and showing what you already looked at) makes me want to help you a lot more than many less-nicely posed questions around here... — Stephan Kolassa, Apr 04 '14 at 08:41
@StephanKolassa, Thank you! I am new to R and have a lot of questions, so I quickly learned how to pose them to get the answer I'm looking for :) — iraserd, Apr 04 '14 at 10:11

Henrik · Accepted Answer · 2014-04-12T11:58:53.060

2

You could also try ave:

# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))

# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.  
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))

#   GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1       1     1                1            1          1
# 2       1     1                2            2          1
# 3       1     2                1            1          1
# 4       1     2                2            2          1
# 5       2     3                3            2          0
# 6       2     4                3            2          0
# 7       2     4                4            1          0
# 8       3     5                5            2          0
# 9       3     5                6            2          0

Edit Added dplyr alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.

First wrap up the alternatives in functions:

fun_ave <- function(df){
  v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df  
}

library(dplyr)
fun_dp <- function(df){
df %.%
  group_by(GroupId, IndId) %.%
  mutate(
    type1 = any(PropertyType == 1)) %.%
  group_by(GroupId, add = FALSE) %.%
  mutate(
    ValidGroup = all(type1) * 1) %.%
  select(-type1)
}


fun_by <- function(df){
  bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
      return(FALSE)   # no PropertyType=1 at all in this group
    } else {
      return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
  cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}

Benchmarks

Original data:

microbenchmark(
  fun_ave(df),
  fun_dp(df),
  fun_by(df))

# Unit: microseconds
#        expr      min        lq    median        uq       max neval
# fun_ave(df)  497.964  519.8215  538.8275  563.5355   651.535   100
#  fun_dp(df)  851.861  870.6765  931.1170  968.5590  1760.360   100
#  fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607   100

On a tiny data set ave is about twice as fast as dplyr and more than 2.5 times faster than by.

Generate some larger data; 10 times the number of groups and individuals

GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))
# Unit: milliseconds
#          expr      min       lq    median        uq       max neval
#  fun_ave(df2) 2.928865 3.185259  3.270978  3.435002  5.151457   100
#   fun_dp(df2) 1.079176 1.231226  1.273610  1.352866  2.717896   100
#   fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680   100

dplyr is three times faster than ave and nearly 10 times faster than by.

100 times the number of groups and individuals

GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))

# Unit: milliseconds
# expr        min         lq    median        uq      max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516   100
#  fun_dp(df2)   3.253872   3.477195   3.58173   3.73378  75.8730   100
#  fun_by(df2)  92.248791 102.122733 104.09577 109.99285 186.6829   100

ave is really loosing ground now. dplyr is nearly 30 times faster than by, and more than 100 times faster than ave.

edited Apr 12 '14 at 11:58

answered Apr 04 '14 at 10:26

Henrik

65,555
14
143
159

While conceptually this should work, using your version with `ave` leads my R to try to allocate 25gb of RAM, which is way over my 8gb... :/ – iraserd Apr 04 '14 at 10:48
But it's a nice illustration, +1. I originally tried something similar, but I'm more fluent in `by()` than in `ave()`, so I'll meditate on this solution a bit more to learn. – Stephan Kolassa Apr 04 '14 at 11:18
1

@iraserd, recently I faced a similar setting with my own data and I needed a solution faster than `ave` and `by`, so I tried `dplyr`. I have updated my answer with a `dplyr` alternative and benchmarks for data of different sizes. `dplyr` is considerably faster than both `ave` and `by` when data sets are getting larger. – Henrik Apr 12 '14 at 11:47
@Henrik alright, it works and indeed produces the same result as the `by` version in only 0.68 seconds, while `by` take 9.61 minutes! However, some caveats: first, your `dply` version behaves oddly with other variables in larger data frames (I got an error for some dummy variable). This is easily circumvented by changing line 2 of your function definition to `df[, c("GroupId", "IndId", PropertyType")] %.%` and then appending the restult column to the original data frame. Second point is that I cannot manage to wrap `dply` into a function in the same format as Stephan Kolassa above. Ideas? – iraserd Apr 14 '14 at 10:30
@Henrik - details of my function wrapping: `Error in mutate_impl(.data, named_dots(...), environment()) : attempt to use zero-length variable name`. Code is: `foobar <- function(object, group, ind, type){ object[, c(group, ind, type)] %.% group_by(object[,group], object[,ind]) %.% mutate( type1 = any(object[,type] == 1)) %.% group_by(object[,group], add = FALSE) %.% mutate( ValidGroup = all(type1) * 1) %.% select(-type1) } ` – iraserd Apr 14 '14 at 10:38
I might have ideas, but it is hard to tell without a reproducible example. My function worked for the data you provided in OP. You need to be clearer than "behaves oddly" and "got an error". Regarding benchmarks, please note `"Unit: milliseconds"`. – Henrik Apr 14 '14 at 10:39
Regarding the benchmark units, that was a silly misconception on my behalf. Regarding proper formulation of the behaviour for my real dataset: `Error in eval(expr, envir, enclos) : column 'DateDiff' has unsupported type` in `traceback()` line nr 27 `27: stop(list(message = "column 'DateDiff' has unsupported type", call = eval(expr, envir, enclos), cppstack = NULL))` Where `DateDiff`is a variable of class "difftime" generated by substracting two class "Date" variables from each other (which is obviously not a dummy - another mistake on my behalf). – iraserd Apr 14 '14 at 10:48
For "wrap dply into a function in the same format as Stephan Kolassa above", these posts may be relevant: [**here**](http://stackoverflow.com/questions/22005419/dplyr-without-hard-coding-the-variable-names) and [**here**](https://groups.google.com/forum/#!topic/manipulatr/cr9PzNEtz6w) (with answer by dplyr-Hadley) – Henrik Apr 14 '14 at 10:51
So you don't use the `Datediff` variable in the actual `dply` call, it is 'just' an additional variable? – Henrik Apr 14 '14 at 10:57
Having a look into your links and see if I can construct the appropriate call. Correct, the `Datediff` variable is just some co-variate in the data frame (which has in total about 50 variables) and no action should be performed on it. – iraserd Apr 14 '14 at 11:02
I added a variable of `Class 'difftime'` to you toy data and got the same error. One possible work-around is to convert the `difftime` variable to `numeric`, something like `as.numeric(difftime(df$date2, df$date1, unit = "days"))`. – Henrik Apr 14 '14 at 11:42
That would work, or selecting only the needed columns and cbinding the column after manipulation, which needs no manipulation of the other variables. However, I still can't manage to wrap the thing into a function. Trying to use substitute but I think I don't understand how the variables I define in `function` are handled later by `substitute(mutate())`. I imagine that by calling `foobar`, the arguments get substituted by the strings and object supplied for the arguments (`foobar(object, string, string, string)`). But I think I'm completely wrong here... – iraserd Apr 14 '14 at 11:50
`foobar <- function(object, group, ind, type){ object[, c(group, ind, type)] %.% group_by(object[,group], object[,ind]) %.% substitute(mutate(type1 = any(typevar == 1)), list(typevar = as.name(type))) %.% group_by(object[,group], add = FALSE) %.% mutate( ValidGroup = all(type1) * 1) %.% select(-type1) }` Where I added the `substitute()` – iraserd Apr 14 '14 at 11:51
I haven't used the methods described in the links I provided, so I'm afraid I can't help you much further with this particular issue. Please feel free to post a new question with a self-contained minimal example together with the code you have tried so far, posting explicit error messages, possibly referring to relevant earlier posts. Good luck! – Henrik Apr 14 '14 at 13:32
@Henrik - with some help [link](http://stackoverflow.com/questions/23062314/using-paste-and-substitute-in-combination-with-quotation-marks-in-r) I finally managed to wrap it into a function. I will edit it into the OP for later reference. – iraserd Apr 14 '14 at 14:26
@iraserd, FYI: I submitted an [issue on `dplyr` github on the problem with the `difftime` variable](https://github.com/hadley/dplyr/issues/390) – Henrik Apr 15 '14 at 14:04

Stephan Kolassa · Answer 2 · 2014-04-04T11:16:27.083

1

Try this:

bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
        return(FALSE)   # no PropertyType=1 at all in this group
    } else {
        return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
cbind(df,bar[as.character(df$GroupId)])

The key is using by() to apply a function by a grouping variable, here your df$GroupId. The function to apply is an anonymous function. For each chunk (defined by the grouping variable), it creates a table of the IndId and PropertyType entries. It then looks whether "1" appears at all in the PropertyType - if not, it returns FALSE, if yes, it looks whether every IndId has at least one "1" entry (i.e., whether all entries in the "1" column of the table are >0).

We store the result of the by() call in a structure bar, which is named according to the levels in the grouping variable. This in turn allows us to roll the result back out to the original data.frame. Note how I am using as.character() here to make sure the integers are interpreted as entry names, not entry numbers. Bad Things often happen when things have names that can be interpreted as numbers.

If you really want a 0-1 result instead of TRUE-FALSE, just add an as.numeric().

EDIT. Let's turn this into a function.

foobar <- function(object, group, ind, type) {
    bar <- by(data=object,INDICES=object[,group],FUN=function(xx){
        foo <- table(xx[,ind],xx[,type])
        if ( !("1" %in% colnames(foo)) ) {
            return(FALSE)   # no PropertyType=1 at all in this group
        } else {
            return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
        }})
    cbind(object,bar[as.character(object[,group])])
}

foobar(df,"GroupId","IndId","PropertyType")

This still requires that the target be exactly "1", but of course this could also be included in the function definition as a parameter. Just be sure to keep column names and variables that contain column names straight.

edited Apr 04 '14 at 11:16

answered Apr 04 '14 at 08:40

Stephan Kolassa

7,953
2
28
48

Bonus question: how do I wrap this into a new function that takes as input the dataframe, GroupID, IndID and PropertyType? `foobar <- function(object,group,ind,type) { ... }` and in the curly braces replace the respective values in the function? When I then call `foobar(df, GroupId, IndId, PropertyType)` I get `Error in by.data.frame(data = object, INDICES = object$group, FUN = function(xx) { : 'names' attribute [1] must be the same length as the vector [0]` – iraserd Apr 04 '14 at 10:16
`foobar <- function(object, group, ind, type) { bar <- by(data=object,INDICES=object$group,FUN=function(xx){ foo <- table(xx$ind,xx$type) if ( !("1" %in% colnames(foo)) ) { return(FALSE) # no type=1 at all in this group } else { return(all(foo[,"1"]>0)) # return whether all ind have an 1 entry }}) cbind(object,bar[as.character(object$group)]) } foobar(df, GroupId, IndId, PropertyType) ` – iraserd Apr 04 '14 at 10:22
I edited my answer to turn it into a function. Your homework will be to do the same with @Henrik's `ave`-based approach ;-) – Stephan Kolassa Apr 04 '14 at 11:17
Great, would give another +1! But what the h*, why do I suddenly need `[,]` instead of `$`! Homework done: `foobar2 <- function(object, group, ind, type) { v1 <- with(object, ave(object[,type], list(object[,group],object[,ind]), FUN = function(x) sum(x ==1 ))) object$ValidGroup <- ave(v1, object[,group], FUN = function(x) sum(x == 0) == 0) } foobar2(df, "GroupId", "IndId", "PropertyType")` – iraserd Apr 04 '14 at 11:28
Writing R functions can be really confusing to me as I come from a Stata background where in functions I simply refer to the arguments of the function via local macros during the call, where the macros are literally replaced by the arguments provided... – iraserd Apr 04 '14 at 11:30
1

`foo$bar` accesses the component (column of a `data.frame` or entry in a `list`) called "bar" in `foo`. `foo[,bar]` accesses the column(s) in `foo` whose name(s) match *the value* of the variable `bar`. Confused now? ;-) – Stephan Kolassa Apr 04 '14 at 11:31
1

And in defining the `foobar <- function(foo, bar)` and calling it with `foobar(df, GroupId)`, "GroupId" becomes _the value_ of `bar` and thus needs to be referred to with `foo[,bar]`. – iraserd Apr 04 '14 at 11:41
What is strange is that I simply added `object <- cbind ...` to automatically bind the column at the end of the function, but when I check after the function runs, the column is not there. – iraserd Apr 04 '14 at 12:06
Do you return `object` at the end of the function? Either by simply writing `object` on a line by itself, or using the (in my eyes, clearer) `return(object)`? – Stephan Kolassa Apr 04 '14 at 12:07
Tried that now. When I include `return(object)` in the function as the last line before `}`, it prints the data frame including the new column. When I the type `df` to view the data frame, the column is no longer there. – iraserd Apr 04 '14 at 12:10
... `object <- cbind(object,bar[as.numeric(object[,group])]) return(object) } foobar(df, "GroupId", "IndId", "PropertyType") df ` – iraserd Apr 04 '14 at 12:10
1

R calls by value, not by reference. Your function won't modify the *global* variable `df`, only the local variable `object`. Do `df <- foobar(df, ...)` (although I would rather assign the result to a new object; overwriting objects with function results makes debugging hard). – Stephan Kolassa Apr 04 '14 at 12:13
So simple but so hard to see - something tells me I should have taken some courses in computer science ... :) – iraserd Apr 04 '14 at 12:17
changed correct answer to Henrik's because of increased efficiency - you might want to have a look! – iraserd Apr 14 '14 at 10:33

Flagging groups in which all members fulfill a certain requirement in R

2 Answers2

Linked