1

I would like to re-create this Stata command in R

 by Area Sex Age: keep if (Infected==1) | ((_n<=1*ncases) & (Infected==0))

This is for a matched case control study

My dataframe contains 193 cases and a variable number of controls per group (Area Sex and Age). I am trying to match 1 random control to each case based on a grouping of Area Sex and Age.

ncases is an integer in my dataframe signifying the number of cases in each group (Area Sex Age)

The command line above works fine in Stata.

But, the R code I have written only works for the first group:

dat5 <- subset(dat4,by=list(Area,Sex,Age),(Infected=1 | 
                                        ((seq(dim(dat4)[1]))<=1*ncases & Infected==0)))

This is my dataframe dat4: Infected=1 is a case, infected=0 is a control.

        Area  Sex Age  CensusNo   Animals Infected ncases 
18825   1     1   23   1023224    0       0        1 
18826   1     1   23   1024109    1       0        1 
18827   1     1   23   1024163    0       1        1 
41428   7     2   50   1047107    1       0        1 
41429   7     2   50   1047029    1       0        1 
41430   7     2   50   1046901    1       1        1 
41439   5     1   36   1047037    1       0        2 
41440   5     1   36   1047127    1       0        2 
41441   5     1   36   1047125    1       0        2 
41442   5     1   36   1047005    1       0        2 
41443   5     1   36   1046994    0       1        2 
41444   5     1   36   1046972    0       1        2 
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
  • 4
    Welcome to Stack Overflow. Have you tried writing any R code yet? If you tried and did not succeed, that's fine but it's good to show us what you have already written. Also try editing this post to include the output of `str(yourdataframe)` or `head(yourdataframe)` for the benefit of those trying to help. [This](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) post is a useful resource for neophyte questioners on Stack Overflow. – SlowLearner Apr 03 '13 at 13:50
  • If `Infected == 0` is part of the answer, `Infected == 1` should be another part. – Nick Cox Apr 03 '13 at 16:12

2 Answers2

1

There is no by parameter in the subset function. I would construct an index vector that would be TRUE for the cases and for a sample of the controls within each category:

chosen <- by(dat4, INDICES= list(Area,Sex,Age), 
          FUN=function(d) {
                 idx <- d[['Infected']]==1 |
                 (d[['Infected']]==0 & sample( (1:NROW(d)) <= d[["ncases"]][1] ))
                 return( d[idx,]}
chosen <- do.call(rbind, chosen)

That last part feels a bit kludgy to me. I am constructing a vector of logical values and then permuting it with the sample function. The by function returns a list with an entry for each category. In this case you need rbind to "stack them up". I suspect there is a more expressive way to do the sampling and that those neural pathways are just in need of more caffeine. (Also untested until you provide a sample dataset for that purpose.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Many thanks for the response, this script works if i remove by=list(Area,Sex,Age) and replace with list(Area,Sex,Age). I have provided the data.frame as I am sure I am going about the whole process in a very round-about way. – daniel hungerford Apr 04 '13 at 15:23
  • You are exactly right! The second argument to `By` is `INDICES`. Edited my answer. (I'm not a big fan of positional matching but in this case success would occur with that route.) – IRTFM Apr 05 '13 at 01:43
1

A data.table solution.

library(data.table)
ddd <- data.table(dat4)
ddd[, {
  # coerce integer Infected to logical
  # not really necessary,  but for robustness
  ii <- as.logical(Infected)
  # if Infected == 1, then ii == TRUE
  if(ii){ # if TRUE, keep all cases
    .SD
    } else {
        # alternatively keep a sample of
        # the 
    .SD[sample.int(.N, size = ncases)]
    }
  } ,  by=list(Area,Sex,Age,ncases, Infected)]
mnel
  • 113,303
  • 27
  • 265
  • 254