0

For a sample dataframe:

df <- structure(list(id = 1:19, region.1 = structure(c(1L, 1L, 1L, 
                                                       1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 5L
), .Label = c("AT1", "AT2", "AT3", "AT4", "AT5"), class = "factor"), 
PoorHealth = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 
               0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L)), .Names = c("id", "region.1", 
                                                            "PoorHealth"), class = "data.frame", row.names = c(NA, -19L))

I want to subset using the BY command, and hoped somebody may be able to help me.

I want to INCLUDE regions (regions.1) in df that satisfy this condition:

  1. Less than (or equal to) 3 occurrences of '1' in the variable 'PoorHealth'

OR this condition:

  1. Where N (i.e. the respondents in each region) is less than or equal to 6.

If anyone has any ideas to help me, I should be very grateful.

KT_1
  • 8,194
  • 15
  • 56
  • 68
  • What did you try so far? – nrussell Feb 02 '16 at 14:08
  • `aggregate(cbind(PoorHealth, resp=1) ~ region.1, FUN=sum, data=df)` and then subsetting. http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega – jogo Feb 02 '16 at 14:16
  • `library(data.table);df[!df$region.1 %in% setDT(df)[,.(.N<=6 || sum(PoorHealth)<=3), by = region.1][,region.1],]` – Colonel Beauvel Feb 02 '16 at 14:20
  • I think none of the regions in your dataset have (1) more than 3 PoorHealth == 1 and (2) more than 6 obs. – Pekka Feb 02 '16 at 14:20
  • @Colonel Beauvel - 1 and 2 are OR not AND (i.e. At1 is kept as it has >=3 1s, AT2 is kept as N is >6, A3 is deleted as N=3 and no 1s, A4 and A5 are deleted N is <6 and there are not enough 1s. – KT_1 Feb 02 '16 at 14:32
  • @KT_1, my condition is OR, as you stated, and respecting the non strict inequality. Or there is a typo in what you state in your question. – Colonel Beauvel Feb 02 '16 at 14:38
  • @KT_1, Colonel is right. You have to fix your logic. Maybe easier to formulate it like "include regions which have THIS OR THIS condition" if that's what you want. – Pekka Feb 02 '16 at 14:52
  • @KT_1 I suggest that you stll review the condition because it's conflicting your comments above. – Pekka Feb 02 '16 at 15:12

1 Answers1

1

This should work. Dno if there is a cleaner way:

library(data.table)

setDT(df)

qualified_regions = df[,which((sum(PoorHealth==1) <=3 | .N <= 6)),region.1][,region.1]
df[region.1 %in% qualified_regions,]

E: I removed the !-mark because OP changed "EXCLUDE" to "INCLUDE" in the original question.

Pekka
  • 2,348
  • 2
  • 21
  • 33
  • 3
    what is the interest of copy/pasting my answer in comment? – Colonel Beauvel Feb 02 '16 at 14:37
  • @ColonelBeauvel I didn't copy it. I think your answer in the comment is not working. Could you compare that and my answer with some looser settings. For example `!(sum(PoorHealth==1) <=1 | .N <= 3)`. My answer return rows but your comment doesn't. If you feel like your answer was exactly the same but with some small typo, then please fix it and make it a proper answer and I will delete my answer – Pekka Feb 02 '16 at 14:47
  • have you at least tried your answer and mine? Apparently not since the "I think it's not working" is very subjective ... you should realize the two answers give the same result: an empty dataframe. So it's better to investigate on the OP side if this what he really wants instead of posting kind of duplicate answers. – Colonel Beauvel Feb 02 '16 at 14:51
  • @ColonelBeauvel I tried it but it is definitely not working because it doesn't return anything even in looser setting. Please refer to my comment above. It also looks different to mine. – Pekka Feb 02 '16 at 14:54
  • @ColonelBeauvel Problem with your answer is that this statement `setDT(df)[,.(.N<=6 || sum(PoorHealth)<=3), by = region.1]` returns a data.table where there is a row for each region and then TRUE/FALSE column. When you proceed to take `[,region.1]` column from that data.table you get all the regions no matter if the condition is fulfilled or not. This is not intended here. Your answer is never returning any rows no matter what is the condition. I am sure that this is not intended here. – Pekka Feb 02 '16 at 15:55
  • 1
    @Laterow phew.. well thanks for seeing it through. I was questioning my sanity already – Pekka Feb 02 '16 at 16:08