0

I am wondering in data.table what is the most efficient or cleanest way to select rows based on the occurrence of some column values. For example, in a 7 column data table, with each value being either 1 or 0, I want all rows where there are exactly 2 values of 1 and 5 values of 0 (1 representing "presence" and 0 "absence").

So far, here is what I am doing, assuming the following data.table (much bigger, here is just a sample of it)

                                    name D2A1.var D2B3.var D3A1.var D4A3.var D5B3.var H2A3.var H4A4.var MA_ancestor.var
Chrom_1;10000034;G;A Chrom_1;10000034;G;A        1        1        1        1        1        1        1               1
Chrom_1;10000035;G;A Chrom_1;10000035;G;A        1        1        1        1        1        1        1               1
Chrom_1;10000042;C;A Chrom_1;10000042;C;A        1        1        1        1        1        1        1               1
Chrom_1;10000051;A;G Chrom_1;10000051;A;G        1        1        1        1        1        1        1               1
Chrom_1;10000070;G;A Chrom_1;10000070;G;A        1        1        1        1        1        1        1               1
Chrom_1;10000084;C;T Chrom_1;10000084;C;T        1        1        1        1        1        1        1               1
Chrom_6;9997224;AT;A               Chrom_6;9997224;AT;A        0        0        0        0        0        1        0               1
Chrom_6;9998654;GTGTGTGTT;G Chrom_6;9998654;GTGTGTGTT;G        0        0        0        0        0        0        0               1
Chrom_6;9999553;TTTC;T           Chrom_6;9999553;TTTC;T        0        0        0        0        0        0        0               1

and if I want all rows where I have 7 1 and let's say only 1 in D2A1.var and D3A1.var I am doing the following

ALL = DT[DT$MA_ancestor.var == 1 & DT$D2A1.var == 1 &DT$D2B3.var == 1 & DT$D3A1.var == 1 & DT$D4A3.var == 1 &DT$D5B3.var == 1 & DT$H2A3.var == 1 & DT$H4A4.var == 1,]
TWO = DT[DT$MA_ancestor.var == 0 & DT$D2A1.var == 1 &DT$D2B3.var == 0 & DT$D3A1.var == 1 & DT$D4A3.var == 0 &DT$D5B3.var == 0 & DT$H2A3.var == 0 & DT$H4A4.var == 0,]
DFlist=list(TWO, ALL)
DFlong = rbindlist(DFlist, use.names = TRUE, idcol = FALSE)

This returns the expected result and is fast enough. However when having multiple conditions it's a lot of typing and a lot of data.table creations. Is there a faster, cleaner and more compact way of achieving this?

sorens
  • 4,975
  • 3
  • 29
  • 52
Alessandro
  • 41
  • 6
  • "For example, in a 7 column data table, with each value being either 1 or 0, I want all rows where there are exactly 2 values of 1 and 5 values of 0." I'd just do `Reduce("+", .SD) == 2` and use that for subsetting. – Roland Aug 06 '19 at 14:59

3 Answers3

1

We can make use of the .SDcols by specifying the columns of interest. Loop through the Subset of Data.table (.SD) create a list of logical vector and Reduce it to a single logical vector with &

ALL <- DT[, Reduce(`&`, lapply(.SD, `==`, 1), .SDcols = nm1]
TWO <- DT[, Reduce(`&`, lapply(.SD, `==`, 0), .SDcols = nm1]

where

nm1 <- names(DT)[-1] #or change the names accordingly
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Is there a faster, cleaner and more compact way of achieving this?

Doing separate queries and rbinding as you do is probably simplest.

You can simplify each query by using replace and join syntax:

# make a list of columns initially set to value 0
vec0 = lapply(DT[, .SD, .SDcols=D2A1.var:MA_ancestor.var], function(x) 0)

# helper function for semi join
subit = function(x, d = DT) d[x, on=names(x), nomatch=0]

rbind(
  subit(replace(vec0, names(vec0), 1)),
  subit(replace(vec0, c("D2A1.var", "D3A1.var"), 1))
)

(This code is not tested since OP's data is not easily reproducible.)

You could probably simplify further like...

subitall = function(..., d = DT, v0 = vec0) 
  rbindlist(lapply(..., function(x) subit( replace(v0, names(v0), 1), d = d )))

subitall( names(vec0), c("D2A1.var", "D3A1.var") )

Regarding the function subit for subsetting / semi-join, you could modify it to meet your needs based on answers in Perform a semi-join with data.table


EDIT: Oh right, following @chinsoon's answer, you could also rbind first:

subit(rbindlist(list(
  replace(vec0, names(vec0), 1),
  replace(vec0, c("D2A1.var", "D3A1.var"), 1)      
)))

This would mean joining only once, which is simpler.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

Another option using setkey:

setkeyv(DT, names(DT))

#create desired filtering conditions as lists
cond1 <- setNames(as.list(rep(1, ncol(DT))), names(DT))
cond2 <- list(MA_ancestor.var=0, D2A1.var=1, D2B3.var=0, D3A1.var=1, D4A3.var=0, D5B3.var=0, H2A3.var=0, H4A4.var=0)

#get list of conditions so that one does not have to type it one by one
scond <- grep("^cond", ls(), value=TRUE)
DT[rbindlist(mget(scond, envir=.GlobalEnv), use.names=TRUE)]

If you are worried about picking up spurious variable starting with cond, you can assign them to an environment using list2env and pass the envir into mget.

data:

DT <- fread("D2A1.var D2B3.var D3A1.var D4A3.var D5B3.var H2A3.var H4A4.var MA_ancestor.var
1        1        1        1        1        1        1               1
1        1        1        1        1        1        1               1
1        1        1        1        1        1        1               1
1        1        1        1        1        1        1               1
1        1        1        1        1        1        1               1
1        1        1        1        1        1        1               1
0        0        0        0        0        1        0               1
0        0        0        0        0        0        0               1
0        0        0        0        0        0        0               1")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35