R data.table group by where

Question

I have the following in an R data.table:

  id | status
=============
   1 |      A
   1 |      B
   2 |      A
   2 |      B
   3 |      A
   4 |      A
   5 |      B

I want to show only the rows grouped by id that have an A, but not a B. So a result would be something like this:

  id | status
=============
   3 |      A
   4 |      A

So far I have this:

dt[, sum(status == "A") > 0 && sum(status == "B") == 0, by = id]

which gets me:

  id | status
=============
   1 |  FALSE
   2 |  FALSE
   3 |   TRUE
   4 |   TRUE
   5 |  FALSE

which I think is on the right track, but I don't know how to just get the rows that I want. Am I on the right track, or am I completely thinking about it the wrong way?

SymbolixAU · Answer 1 · 2017-12-12T04:30:32.780

3

In this (I assume simplified) example you can filter the rows that have a B, then anti-join those back onto the original data

dt[ !dt[ status == "B" ], on = "id" ]

#    id status
# 1:  3      A
# 2:  4      A

This works because

dt[ status == "B"]

gives the 'id's that have a B. These are the ones you want to exclude from your result set, which can be achieved through the anti-join (the dt[ !dt[ ] on = "" ] notation)

If your data is more complex, you can add some extra filters into this logic to first filter the values you want, e.g.

dt[status == "A"][ !dt[ status == "B" ], on = "id" ]

will ensure your returned result will include the IDs that have an A and not a B

edited Dec 12 '17 at 04:30

answered Dec 12 '17 at 04:21

SymbolixAU

25,502
4
67
139

For the last point, to get ids with an A, might need a self semi join https://stackoverflow.com/questions/18969420/perform-a-semi-join-with-data-table like `dt[dt[status=="A", .(id)], on="id"]` or something. – Frank Dec 12 '17 at 17:00
@Frank that won't exclude those `A` ids that also have a `B` value ? – SymbolixAU Dec 12 '17 at 21:17
Yes, I meant just the first part of the `dt[...][...]` chain. So with `DT = data.table(id = 1, status = LETTERS[c(1,3)])`, I'd want to see all rows for the `id`, not just the first one, so `DT[.(DT[status == "A", unique(id)]), on="id"][!.(DT[status == "B", unique(id)]), on="id"]` not `DT[status == "A"][!.(DT[status == "B", unique(id)]), on="id"]` – Frank Dec 12 '17 at 21:28

R data.table group by where

1 Answers1