how to subset data frame if all the conditions are met?

Question

i have two message ids say 197, 198 . I want to subset the data frame for those users who have got the messages from these ids. I only want those rows which contains both these message ids.

The data frame is m

I have used the code

a = c(197,198)
n = subset(m$userid,m$mid %in% a)

I also tried

n = m[m$mid == 197 & m$mid == 198]

both of these codes are creating OR output whereas I want AND output.

here is the sample dataframe:

mid userid opened

197 1022    Y
197 1036    N
197 1100    Y
198 1000    Y
198 1022    N
198 1036    Y

I want output as records containing userid for both mid 197 &198

mid userid opened
197 1022    Y
197 1036    N
198 1022    N
198 1036    Y

If `m` is a data.frame you need a comma `m[m$mid == 197 | m$mid == 198, ]` or `subset(m, mid %in% 197:198))` The logical statement of 'mid' having both 197 and 198 at the same row is not possible. Please show a small reproducible example and expected output for others to help you better — akrun, Feb 14 '18 at 07:15
Please use `dput` and share part of `m`. It seems a separate different question and could stand on its own. — MKR, Feb 14 '18 at 07:28
A possible solution using `sqldf` could be: `sqldf("SELECT * FROM m WHERE userid in (SELECT userid FROM m WHERE mid == 197) AND userid in (SELECT userid FROM m WHERE mid == 198)")` — MKR, Feb 14 '18 at 07:40
`df %>% group_by(userid) %>% filter(all(c(197, 198) %in% mid)) ` or in base R `df[ave(df$mid, df$userid, FUN = function(x) all(c(197, 198) %in% x)) == 1, ]` — Ronak Shah, Feb 14 '18 at 09:17

score 1 · Answer 1 · answered Feb 14 '18 at 09:34

1

Using sqldf one solution could be achieved as:

# data
m <- read.table(text = "mid userid opened
197 1022    Y
197 1036    N
197 1100    Y
198 1000    Y
198 1022    N
198 1036    Y", header = T, stringsAsFactors = F)

library(sqldf)

result <- sqldf("SELECT * FROM m 
                WHERE userid in (SELECT userid FROM m WHERE mid == 197) AND
                userid in (SELECT userid FROM m WHERE mid == 198)")

result 

#     mid userid opened
#   1 197   1022      Y
#   2 197   1036      N
#   3 198   1022      N
#   4 198   1036      Y

answered Feb 14 '18 at 09:34

MKR

19,739
4
23
33

1

Can also express using WITH clause: `sqldf("WITH userid197 AS (SELECT userid FROM m WHERE mid == 197), userid198 AS (SELECT userid FROM m WHERE mid == 198) SELECT * FROM m WHERE userid IN userid197 AND userid IN userid198")` – G. Grothendieck Feb 14 '18 at 10:41
How can you scale this approach if there are more than 2 elements in `a`? – Uwe Feb 14 '18 at 11:17

score 1 · Answer 2 · answered Feb 14 '18 at 09:37

Using duplicated :

m[duplicated(m$userid) | duplicated(m$userid,fromLast = T), ]

#   mid userid opened
# 1 197   1022      Y
# 2 197   1036      N
# 5 198   1022      N
# 6 198   1036      Y

With your real data you may need first : m2 <- subset(m,mid %in% a) to make sure you have only mid from a in your table before applying my solution.

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

For the sake of completeness, here are two data.table approaches. Both are capable of handling a of arbitrary length, i.e., with more than just 2 selected mid.

Join

library(data.table)
setDT(m)[m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)],
         on = "userid"]

   mid userid opened V1
1: 197   1022      Y  2
2: 198   1022      N  2
3: 197   1036      N  2
4: 198   1036      Y  2

The expression

m[mid %in% a][, uniqueN(mid), by = .(userid)][V1 == uniqueN(a)]

   userid V1
1:   1022  2
2:   1036  2

filters m, then counts the number of unique mid by userid and returns those userid which have matches with all entries in a. (Instead of uniqueN(a), length(a) can be used but the former is safer).

Subsetting by row indices

There is an alternative approach which returns the row id's of m which are then used for subsetting:

m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]

   userid V1
1:   1022  1
2:   1022  5
3:   1036  2
4:   1036  6

m[m[mid %in% a][, .I[uniqueN(mid) == uniqueN(a)], by = .(userid)]$V1]

   mid userid opened
1: 197   1022      Y
2: 198   1022      N
3: 197   1036      N
4: 198   1036      Y

how to subset data frame if all the conditions are met?

3 Answers3

Join

Subsetting by row indices