1

In R data.table I have this table:

  Company <- c ("A", "A" , "A", "A", "B" , "B", "B", "B")
  TopManger <- c(1, 1, 1, 0, 1, 1, 0, 0)
  Salary <- c(300, 300, 300, NA, 250,250, NA, 100)
  tbl = data.table(company, TopManger, Salary)

that looks like:

    Company TopManger Salary
1:       A         1    300
2:       A         1    300
3:       A         1    300
4:       A         0     NA
5:       B         1    250
6:       B         1    250
7:       B         0     NA
8:       B         0    100

I perform this query:

tbl[, HighPayedComp := as.numeric(Salary[TopManger == 1] > 200), by=Company]

And I expect to get this:

   Company TopManger Salary HighPayedComp
1:       A         1    300             1
2:       A         1    300             1
3:       A         1    300             1
4:       A         0     NA             1
5:       B         1    250             1
6:       B         1    250             1
7:       B         0     NA             1
8:       B         0    100             1

In this minimal example it works but in my big data.table in some arbitrary but fixed situations coerces NA where logically it shouldn't: For example in line 5 I have NA for HighPayedComp.

Amin
  • 57
  • 1
  • 8
  • 1
    It would be helpful if you provided a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to make it more clear what's going on. Describe precisely the behavior you expect. – MrFlick Sep 07 '16 at 03:24

1 Answers1

2

It could be due to the presence of NA in the 'TopManager' column. The == returns NA whenever there is a NA. It can be avoided by either using %in% or creating another condition with !is.na

tbl[, HighPayedComp:= as.integer(salary[TopManager == 1 & 
                         !is.na(TopManager)] > 200), by = Company]

Or using %in%

tbl[, HighPayedComp:= as.integer(salary[TopManager %in% 1] > 200), by = Company]

This is because of the behavior of ==

v1 <- c(NA, 1, 3)
v1 ==1
#[1]    NA  TRUE FALSE

whereas

!is.na(v1) & v1==1
#[1] FALSE  TRUE FALSE

takes care of the NA elements

If we use %in%, the code is basically doing a match and converting it to logical (> 0L)

`%in%`
#function (x, table) 
#match(x, table, nomatch = 0L) > 0L

In the vector above, the first element is NA and doing the logical comparison with == returns NA.

It is also described in the ?"=="

Missing values (NA) and NaN values are regarded as non-comparable even to themselves, so comparisons involving them will always result in NA. Missing values can also result when character strings are compared and one is not valid in the current collation locale.

data

tbl <- data.table(Company = rep(LETTERS[1:3], each =2), TopManager = c(1, 0, NA,
  1, 1, NA), salary = c(250, 300, 220, 180, 240, 200))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It works but doesn't make sense to me. In some rows where it coerces unwanted NA, it's not compared to NA. – Amin Sep 07 '16 at 04:20
  • You are ignoring recycling. `x[cond] < y` could have any length between `0` and `.N` (in principle) and yet you are assigning it as if it recycles safely. – Frank Sep 07 '16 at 06:20