Select or delete duplicate records based on a rule

Question

I have the following dataset that contains id, sex, and a numeric variable, xvar.

id <- c(1,1,1,1,2,2,3,3,4,4,4,5,5)
sex <- c(1,1,1,1,2,2,2,2,1,1,1,2,2)
xvar <- c(10,11,10,12,9,9.1,10,10.4,3,2.9,4,11,11.1)
df <- data.frame(id,sex,xvar)

For each id, I want to check the min and max of xvar. If 1.05*min(xvar) >= max(xvar) then I need to keep the records. Otherwise, delete them.

For example, if id is 1, min(xvar)=10 and max(xvar)=12. Also, 1.05*10 < 12 ... then delete the records for the id of 1.

Another example is when id is 5. So min(xvar)=11, max(xvar)=11.1, and 1.05*11 > 11.1. Keep the records where id is 5.

`df[as.logical(with(df, ave(xvar, id, FUN = function(x) 1.05*min(x) >= max(x)))), ]` — rawr, Nov 17 '15 at 15:15

MichaelChirico · Accepted Answer · 2015-11-17T15:16:38.587

3

This can be done with data.table as:

library(data.table)
setDT(df)
output <- df[ , if (1.05 * min(xvar) >= max(xvar)) .SD, by = id]

by = id (invisibly) partitions the table into a set of length(unique(id)) data.tables, one for each value of id; within each of these, we find the range of xvar and return the entire table (i.e., .SD) only if your condition is met.

Some more about .SD:

First, notice that .SD is in the j argument, which is usually a list of columns or a list of expressions involving columns, so .SD must also be a list. What list is it? It's the list of all columns in the data.table.

(See ?data.table for more advanced usage, e.g., the .SDcols argument which allows us to specify a subset of columns to be denoted by .SD)

edited Nov 17 '15 at 15:16

answered Nov 17 '15 at 15:05

MichaelChirico

33,841
14
113
198

Thank you. by the way, what does .SD do? – user9292 Nov 17 '15 at 15:13
@user9292 see edit. I'm with you that it's a bit nebulous at first; let me know whether I've explained it well enough. – MichaelChirico Nov 17 '15 at 15:17
Thank you, @MichaelChirico. That's helpful. – user9292 Nov 17 '15 at 15:21

score 2 · Answer 2 · answered Nov 17 '15 at 15:11

2

You can do this in dplyr, too:

library(dplyr)
df2 <- df%>%
  group_by(id)%>%
  dplyr::filter(1.05*min(xvar)>=max(xvar))

group_by creates 'blocks' of data to iterate through at a time, the filter code is then applied to each of these blocks in turn.

answered Nov 17 '15 at 15:11

Pash101

631
3
14

why use `dplyr::filter` (as opposed to just `filter`)? Is it overloaded? – MichaelChirico Nov 17 '15 at 15:55
I added this to prevent any confusion with the filter function from the stats package (which would run, but not give the correct solution). I've noted some users have had this issue before (http://stackoverflow.com/questions/26935095/r-dplyr-filter-not-masking-base-filter) – Pash101 Nov 17 '15 at 16:00

Select or delete duplicate records based on a rule

2 Answers2