1

I have the following dataset that contains id, sex, and a numeric variable, xvar.

id <- c(1,1,1,1,2,2,3,3,4,4,4,5,5)
sex <- c(1,1,1,1,2,2,2,2,1,1,1,2,2)
xvar <- c(10,11,10,12,9,9.1,10,10.4,3,2.9,4,11,11.1)
df <- data.frame(id,sex,xvar)

For each id, I want to check the min and max of xvar. If 1.05*min(xvar) >= max(xvar) then I need to keep the records. Otherwise, delete them.

For example, if id is 1, min(xvar)=10 and max(xvar)=12. Also, 1.05*10 < 12 ... then delete the records for the id of 1.

Another example is when id is 5. So min(xvar)=11, max(xvar)=11.1, and 1.05*11 > 11.1. Keep the records where id is 5.

Erik Gillespie
  • 3,929
  • 2
  • 31
  • 48
user9292
  • 1,125
  • 2
  • 12
  • 25
  • `df[as.logical(with(df, ave(xvar, id, FUN = function(x) 1.05*min(x) >= max(x)))), ]` – rawr Nov 17 '15 at 15:15

2 Answers2

3

This can be done with data.table as:

library(data.table)
setDT(df)
output <- df[ , if (1.05 * min(xvar) >= max(xvar)) .SD, by = id]

by = id (invisibly) partitions the table into a set of length(unique(id)) data.tables, one for each value of id; within each of these, we find the range of xvar and return the entire table (i.e., .SD) only if your condition is met.

Some more about .SD:

First, notice that .SD is in the j argument, which is usually a list of columns or a list of expressions involving columns, so .SD must also be a list. What list is it? It's the list of all columns in the data.table.

(See ?data.table for more advanced usage, e.g., the .SDcols argument which allows us to specify a subset of columns to be denoted by .SD)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
2

You can do this in dplyr, too:

library(dplyr)
df2 <- df%>%
  group_by(id)%>%
  dplyr::filter(1.05*min(xvar)>=max(xvar))

group_by creates 'blocks' of data to iterate through at a time, the filter code is then applied to each of these blocks in turn.

Pash101
  • 631
  • 3
  • 14
  • why use `dplyr::filter` (as opposed to just `filter`)? Is it overloaded? – MichaelChirico Nov 17 '15 at 15:55
  • I added this to prevent any confusion with the filter function from the stats package (which would run, but not give the correct solution). I've noted some users have had this issue before (http://stackoverflow.com/questions/26935095/r-dplyr-filter-not-masking-base-filter) – Pash101 Nov 17 '15 at 16:00