subset data based on condition in r

Question

I want to select those household where all the member's age is greater than 20 in r.

 household Members_age
   100          75
   100          74
   100          30
   101          20
   101          50
   101          60
   102          35
   102          40
   102           5

Here two household satisfy the condition. Household 100 and 101. How to do it in r?

what I did is following but it's not working.

sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
   100          75
   102          35

Please suggest. Here is the sample dataset

library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
               Members_age = c(75,74,30,20,50,60,35,40,5))

Household 101 *doesn't* match the condition of all members being greater than 20, it matches all members being greater than or equal to 20 — camille, Sep 13 '21 at 15:27

score 2 · Accepted Answer · answered Sep 13 '21 at 14:26

2

You can use ave.

data[ave(data$Members_age, data$household, FUN=min) > 20,]
#  household Members_age
#1       100          75
#2       100          74
#3       100          30

or only the households.

unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100

answered Sep 13 '21 at 14:26

GKi

37,245
2
26
48

@ GKi Household 101's member's are also greater than 20. – joy_1379 Sep 13 '21 at 14:33
Household 101 has one member with age=20. I assumed that 20 is not greater than 20. But maybe you want to use `>=` instead of `>`. – GKi Sep 13 '21 at 14:57

score 2 · Answer 2 · answered Sep 13 '21 at 14:57

I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).

I suggest to implement your logic, you would change your sqldf code to the following:

sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
#   household Members_age
# 1       100          30

which is effectively the SQL analog of GKi's ave answer.

An alternative:

library(dplyr)
data %>%
  group_by(household) %>%
  filter(all(Members_age > 20)) %>%
  ungroup()
# # A tibble: 3 x 2
#   household Members_age
#       <dbl>       <dbl>
# 1       100          75
# 2       100          74
# 3       100          30

and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).

But for base R, I think nothing is likely to be better than GKi's use of ave.

subset data based on condition in r

2 Answers2

Linked