Efficiently removing missing values from the start and end of multiple time series in 1 data frame

Question

Using R, I'm trying to trim NA values from the start and end of a data frame that contains multiple time series. I have achieved my goal using a for loop and the zoo package, but as expected it is extremely inefficient on large data frames.

My data frame look like this and contains 3 columns with each time series identified by it's unique id. In this case AAA, B and CCC.

id   date          value
AAA  2010/01/01    NA
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/01/01    NA
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
B    2010/06/01    NA
B    2010/07/01    NA
B    2010/07/01    NA
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200
CCC  2010/05/01    NA

I would like to know, how can I efficiently remove the NA values from the start and end of each time series, in this case AAA, B and CCC. So it should look like this.

id   date          value
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200

I have identified the unique id's (60,000 of them) Then used a for loop to loop through them, each time then creating a subset for the code and creating a zoo object Then using zoo packages trim function to strip leading and trailing missing values Then rbinding each to a new data frame that will contain the trimmed time series data in the end. As expected, this is very ineficient. — sizeight, May 30 '12 at 07:11

Matt Dowle · Accepted Answer · 2012-05-30T10:45:08.000

7

I would do it like this, which should be very fast :

require(data.table)
DT = as.data.table(your data)   # please provide something pastable

DT2 = DT[!is.na(value)]
setkey(DT,id,date)
setkey(DT2,id,date)
tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"]
DT = DT[tokeep]

This works by rolling forward the prevailing non-NA, but not past the last one, within each group.

The mult="last" is optional. It should speed it up if v1.8.0 (on CRAN) is used. Interested in timings with and without it. By default data.table joins to groups (mult="all"), but in this case we're joining to all columns of the key, and, we know the key is unique; i.e., no dups in key. In v1.8.1 (in dev) there isn't a need to know about this and it looks after you more.

edited May 30 '12 at 10:45

answered May 30 '12 at 09:25

Matt Dowle

58,872
22
166
224

I was preparing an answer with `data.table`, too, but it's good that Matthew beat me to it. However, one alternative would be to use the `na.trim` function from the `zoo` package. Something like `DT[, na.trim(.SD), by = id`, since that function accepts objects other than `zoo` objects. – BenBarnes May 30 '12 at 10:07
@BenBarnes Sounds good, and shorter. Interesting to see which is faster. – Matt Dowle May 30 '12 at 10:26
@Matthew Dowle Works perfectly and extremely fast. Does exactly what I was interested in and I can prove to the die hard SAS coders at my office once again that R is a viable alternative. I think this executed way faster that their SAS alternative. – sizeight May 30 '12 at 11:59
@BenBarnes I'm going to look into your suggestion about using na.trim as well. – sizeight May 30 '12 at 12:00
@sizeight, Matthew's answer is much faster than mine (especially with a large amount of data). Trust the person who made the tools to know how to use them best! – BenBarnes May 30 '12 at 12:56
@Matthew Dowle I upgraded to data.table 1.8.8 today and ran into problems with the above solution. I get an error "Attempting roll join on factor column x.date. Only integer, double or character colums may be roll joined." Any suggestions? – sizeight Jul 16 '13 at 14:38

score 0 · Answer 2 · edited May 23 '17 at 12:08

If your data is in data frame data

fun <- function(x)
{
    x$value[is.na(x$value)] <- "NA"
    tmp <- rle(x$value)
    values <- tmp$values
    lengths <- tmp$lengths
    n <- length(values)

    nr <- nrow(x)
    id <- c()
    if(values[1] == "NA") id <- c(id, 1:lengths[1])
    if(values[n] == "NA") id <- c(id, (nr-lengths[n]+1):nr)
    if(length(id) == 0)return(x)
    x[-id,]
}

do.call(rbind,
        by(data, INDICES=data$id,
           FUN=fun))

Not the most elegant solution I guess. In the mood of this post.

Efficiently removing missing values from the start and end of multiple time series in 1 data frame

2 Answers2

Linked