0

This post concerns subsetting data using package data.table based on a compound condition including a logical AND operator, in particular differences in results obtained with & vs &&.

Environment: R version 3.2.1 (2015-06-18), x86_64-w64-mingw32/x64 (64-bit), Windows 10 Pro, data.table 1.9.4.

I’m subsetting data.table used in a regression call; details of the model are suppressed below, but the data clause of the call is reproduced in full.

lm( y ~ u + v + w, data=DT[condo != 1 &<&> apt != 1] ) 

Inclusion of the second & (in angle brackets) gives an alternate form of the expression.

Data.table DT has approx. 25,000 rows. Variables condo and apt are never-null dummies taking values in {0,1}. As it turns out in the instance I’m working on, variable apt is always 0.

Using a single & selects rows of DT as desired, excluding rows where condo == 1. When both ampersands && are used, however, no rows are excluded and the regression is run against all of DT.

So my question(s): Why does this happen? How is package data.table processing the i condition against the rows of DT? Does the distinguished behavior of && with respect to condo[1] and apt[1] explain the observed behavior? (In the first row of the data.table, condo = 0 and apt = 0.)

And a bonus question: Under what conditions should a condition such as condo != 1 be written as condo != 1L, given R’s storage of (undeclared) ints as doubles? This isn’t just an idle question; data subsetting based on the values of dummies arises frequently in my work.

Jaap
  • 81,064
  • 34
  • 182
  • 193
jackw19
  • 375
  • 2
  • 4
  • 7
  • Sorry -- my question got the wrong title. it should be: Subsetting R Data.Table with Compound Condition. – jackw19 Aug 14 '15 at 18:45
  • as for the `1` vs `1L` - if your `class(condo)` is `integer`, it'll be more efficient to use `1L`, though the time savings are generally minimal except in heavy loops – eddi Aug 14 '15 at 19:19
  • @jackw19 I suppose it's possible that the linked answers might shed light on this question but the behavior of logical tests inside `[.data.table` might not necessarily be as expected. Put in a comment if you want this reopened. It would also be good to post an edit that includes a working example. – IRTFM Aug 14 '15 at 19:25
  • @BondedDust I suppose it's possible that if you read the OP, that might shed light why the linked question answers exactly the issue OP is having ;) – eddi Aug 14 '15 at 19:47
  • Scanning the linked answers indicates this is perhaps a somewhat more complex question than I first thought. I'll supply a simple dataset that illustrates my question. – jackw19 Aug 17 '15 at 03:02
  • I'm wondering how data.table evaluates subsetting conditions that don't include the moral equivalent of SQL aggregation functions. In an SQL select statement, as I understand the data model, the machine / package would grab a row at a time from the from-table, evaluate the where-condition(s) looking at just that current row, output or skip that row, and get on to the next. I don't know if that describes the way data.table processes i-conditions, or if it uses some deeply clever vectorized method. Presumably that would bear on the workings of scalar vs vectorized logical operators. – jackw19 Aug 17 '15 at 03:16
  • @jackw19 exact same way as pretty most subset operations in R. Any boolean expression is recycled to have same length as the entire `data.table`, so a single `TRUE` selects all of the rows, while e.g. `c(T,F)` would select all odd rows. – eddi Aug 17 '15 at 20:51

0 Answers0