This old question has been viewed over ten thousand times now.
Although it has an accepted answer I feel the question deserves
- a genuine
data.table
solution,
- an explanation why
Date
fails with ifelse()
and
- why the
replace()
approach returns the wrong results.
data.table
approach
With data.table
, ifelse()
and replace()
can be written as two chained assignment operations where the second one uses subsetting:
dt[, z := y][is.na(z), z := x][]
x y z
1: 2014-01-01 <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01 <NA> 2016-01-01
The first assignment operation creates a new column z
by copying the y
column. The second assignment operation modifies z
in place by copying the contents of the x
column only to those rows where z
is NA
.
Alternatively, we can take a copy of x
first and the replace the z
values with the non-NA
y
values:
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := x][!is.na(y), z := y][]
The latter might be more efficient if there are many NA
values in y
.
Errors in the replace()
approaches
Frank has suggested to use replace()
instead of ifelse()
which was picked up by C8H10N4O2 in an edit of his answer. Unfortunately, both codes not only generate warnings but simply return the wrong result:
dt <- copy(dt_orig) # use a fresh copy of dt
# C8H10N4O2's version
dt[, z := replace(y, is.na(y), x)][]
dt <- copy(dt_orig) # use a fresh copy of dt
# Frank's version
dt[, z := replace(y, which(is.na(y)), x)][]
x y z
1: 2014-01-01 <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01 <NA> 2015-01-01
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
The value of z
in row 3 has been copied from x
in row 2 which is wrong. Instead, it should have been copied from row 3.
What has happened here? The help page on replace(x, list, values)
says
replace
replaces the values in x
with indices given in list
by those given in values
.
In our example, list
gets the row indices 1, 3
while values
gets 2014-01-01, 2015-01-01, 2016-01-01
. The different lengths is the reason for the warning message. And it is obvious that the second index in list
which is row 3 is replaced by the second value in values
which is 2015-01-01
.
The correct use of replace()
requires to subset x
as well:
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := replace(y, is.na(y), x[is.na(y)])][]
which yields
x y z
1: 2014-01-01 <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01 <NA> 2016-01-01
without any warning.
Why Date
fails with ifelse()
The help page on ifelse(test, yes, no)
has a long Warning section which starts
The mode of the result may depend on the value of test
[...], and the class attribute [...] of the result is
taken from test
and may be inappropriate for the values selected from
yes
and no
.
Sometimes it is better to use a construction such as
(tmp <- yes; tmp[!test] <- no[!test]; tmp)
Applying this suggestion to our example
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][]
we do get
x y z
1: 2014-01-01 <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01 <NA> 2016-01-01
Data
library(data.table) # version 1.11.4 used
dt_orig <-data.table(x = as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")),
y = as.Date(c(NA, "2015-6-1", NA)))
Benchmark
As there are 5 different approaches available now, I was wondering what the fastest method is. The run time may depend on the number of rows but also on the share of NA
values in y
.
So, the press()
function from the bench
package is used to investigate the impact of the two parameters on benchmarks results.
bm <- bench::press(
n_rows = c(100, 1E4, 1E6),
na_share = c(0.1, 0.5, 0.9),
{
dt_bm <- data.table(x = as.Date("1970-01-01") + seq_len(n_rows),
y = as.Date("2970-01-01") + seq_len(n_rows))
set.seed(1L)
dt_bm[sample(seq_len(n_rows), na_share * n_rows), y := NA]
bench::mark(
ifelse = copy(dt_bm)[, z := as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")][],
replace = copy(dt_bm)[, z := replace(y, is.na(y), x[is.na(y)])][],
tmp = copy(dt_bm)[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][],
copy_y = copy(dt_bm)[, z := y][is.na(z), z := x][],
copy_x = copy(dt_bm)[, z := x][!is.na(y), z := y][]
)
}
)
library(ggplot2)
autoplot(bm) + theme_bw()
