1

I have a data.table like so

dt <- data.table(x=as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")), y=as.Date(c(NA, "2015-6-1", NA)))
dt
            x          y
1: 2014-01-01       <NA>
2: 2015-01-01 2015-06-01
3: 2016-01-01       <NA>

I want to add a column z which is equal to y where y is not NA, and x otherwise.

dt[, z:=ifelse(is.na(y), x, y)]
dt
            x          y     z
1: 2014-01-01       <NA> 16071
2: 2015-01-01 2015-06-01 16587
3: 2016-01-01       <NA> 16801

But for some reason the above statement casts z to numeric. If I try to convert it to a date with as.Date I get an error

dt[, z:=as.Date(ifelse(is.na(y), x, y))]
Error in as.Date.numeric(ifelse(is.na(y), x, y)) : 'origin' must be supplied

What gives and how do I accomplish what I'm trying to do?

Uwe
  • 41,420
  • 11
  • 90
  • 134
Ben
  • 20,038
  • 30
  • 112
  • 189
  • Regarding "for some reason [`ifelse`] casts z to numeric", see [How to prevent ifelse() from turning Date objects into numeric objects](https://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects) – Henrik Jul 15 '18 at 19:30

3 Answers3

8

This old question has been viewed over ten thousand times now.

Although it has an accepted answer I feel the question deserves

  • a genuine data.table solution,
  • an explanation why Date fails with ifelse() and
  • why the replace() approach returns the wrong results.

data.table approach

With data.table, ifelse() and replace() can be written as two chained assignment operations where the second one uses subsetting:

dt[, z := y][is.na(z), z := x][]
            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

The first assignment operation creates a new column z by copying the y column. The second assignment operation modifies z in place by copying the contents of the x column only to those rows where z is NA.

Alternatively, we can take a copy of x first and the replace the z values with the non-NA y values:

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := x][!is.na(y), z := y][]

The latter might be more efficient if there are many NA values in y.

Errors in the replace() approaches

Frank has suggested to use replace() instead of ifelse() which was picked up by C8H10N4O2 in an edit of his answer. Unfortunately, both codes not only generate warnings but simply return the wrong result:

dt <- copy(dt_orig)   # use a fresh copy of dt
# C8H10N4O2's version 
dt[, z := replace(y, is.na(y), x)][]

dt <- copy(dt_orig)   # use a fresh copy of dt
# Frank's version
dt[, z := replace(y, which(is.na(y)), x)][]
            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2015-01-01
Warning message:
In NextMethod(.Generic) :
  number of items to replace is not a multiple of replacement length

The value of z in row 3 has been copied from x in row 2 which is wrong. Instead, it should have been copied from row 3.

What has happened here? The help page on replace(x, list, values) says

replace replaces the values in x with indices given in list by those given in values.

In our example, list gets the row indices 1, 3 while values gets 2014-01-01, 2015-01-01, 2016-01-01. The different lengths is the reason for the warning message. And it is obvious that the second index in list which is row 3 is replaced by the second value in values which is 2015-01-01.

The correct use of replace() requires to subset x as well:

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := replace(y, is.na(y), x[is.na(y)])][]

which yields

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

without any warning.

Why Date fails with ifelse()

The help page on ifelse(test, yes, no) has a long Warning section which starts

The mode of the result may depend on the value of test [...], and the class attribute [...] of the result is taken from test and may be inappropriate for the values selected from yes and no.

Sometimes it is better to use a construction such as

(tmp <- yes; tmp[!test] <- no[!test]; tmp)

Applying this suggestion to our example

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][]

we do get

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

Data

library(data.table)   # version 1.11.4 used
dt_orig <-data.table(x = as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")), 
                y = as.Date(c(NA, "2015-6-1", NA)))

Benchmark

As there are 5 different approaches available now, I was wondering what the fastest method is. The run time may depend on the number of rows but also on the share of NA values in y.

So, the press() function from the bench package is used to investigate the impact of the two parameters on benchmarks results.

bm <- bench::press(
  n_rows = c(100, 1E4, 1E6),
  na_share = c(0.1, 0.5, 0.9),
  {
    dt_bm <- data.table(x = as.Date("1970-01-01") + seq_len(n_rows),
                        y = as.Date("2970-01-01") + seq_len(n_rows))
    set.seed(1L)
    dt_bm[sample(seq_len(n_rows), na_share * n_rows), y := NA]
    bench::mark(
      ifelse = copy(dt_bm)[, z := as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")][],
      replace = copy(dt_bm)[, z := replace(y, is.na(y), x[is.na(y)])][],
      tmp = copy(dt_bm)[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][],
      copy_y = copy(dt_bm)[, z := y][is.na(z), z := x][],
      copy_x = copy(dt_bm)[, z := x][!is.na(y), z := y][]
    )
  }
)

library(ggplot2)
autoplot(bm) + theme_bw()

enter image description here

Uwe
  • 41,420
  • 11
  • 90
  • 134
4

When R looks at dates as integers, its origin is January 1, 1970.

https://stats.idre.ucla.edu/r/faq/how-does-r-handle-date-values/

dt[, z:=as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")]

update: as Frank suggests, this also seems to work and does not seem to require un-coercion: dt[, z:=replace(y, is.na(y), x)]. It throws a warning so use w/ caution.

Uwe
  • 41,420
  • 11
  • 90
  • 134
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • instead of `ifelse`, `replace` is natural here. Also, with help files, you might want to provide the code to reach them `?dates`, rather than a link. – Frank Jul 07 '15 at 16:58
  • @Frank - thanks for the hint, although I did not find `?dates` terribly illuminating. I am also not sure why `replace` throws `number of items to replace is not a multiple of replacement length` since `is.na(y)` and `x` are both length 3. – C8H10N4O2 Jul 07 '15 at 17:17
  • 1
    Oh, my bad. I misread your url as a ethz one (which just mirrors built-in docs). The replace syntax is `replace(y,which(is.na(y)),x)` ... not sure if that resolves the error you're hitting. – Frank Jul 07 '15 at 17:21
  • Both variants using `replace()` return the wrong result. The warning must not be ignored. See [here](https://stackoverflow.com/a/51349025/3817004) for details. – Uwe Jul 15 '18 at 13:55
2
dt[, z:=as.Date(ifelse(is.na(y), x, y),origin="1970-01-01")]
dt
            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39