Simple loop leads to memory overflow (base R)

Question

I am getting my memory to swap with a pretty simple loop and i can't see the Problem. I am working on a tool to clean time series on 10 minutes time steps. It may have gaps of time steps, double time steps and out-of-regular-10-minutes-interval-time-steps. My approach is to generate the "clean" time series first and than match the "good" time steps. After that i would like to check for out-of-regular-10-minutes-interval-time-steps. This is where the problem appears. Sorry for the long code:

Test Data Generation:

rm(list = ls())
Sys.setenv(TZ="Europe/Berlin")
Sys.timezone()
DATE = seq( as.POSIXct("2015-03-28 00:00:00", tz="Europe/Berlin"),
            as.POSIXct("2015-04-26 23:00:00", tz="Europe/Berlin"), by = 600)
V1 = round(2*runif(length(DATE)), 2)
DF <- data.frame(DATE, V1)

Adding some "bad" data:

DF2 <- data.frame(DATE= as.POSIXct(c("2015-04-05 05:00:00", 
                                     "2015-04-05 05:00:00", 
                                     "2015-04-10 10:00:00", 
                                     "2015-04-15 15:15:00", 
                                     "2015-04-20 20:02:00", 
                                     "2015-04-26 23:07:00",
                                     "2015-04-26 23:17:00",
                                     "2015-04-26 23:27:00",
                                     "2015-04-26 23:37:00")),
                  V1 = c("0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77", 
                         "0.77"))
DF <- rbind(DF, DF2)
DF <- DF[ order(DF$DATE), ]

Defining some time variables and the final "clean" time series:

START_DATE    <- as.POSIXct("2015-03-28 00:00:00", tz="Europe/Berlin") 
END_DATE      <- as.POSIXct("2015-04-26 23:40:00", tz="Europe/Berlin")
tdiff         <- difftime("2015-03-28 00:10:00", "2015-03-28 00:00:00", 
                   tz="Europe/Berlin", units = "mins")
DT            <- seq( START_DATE, END_DATE, by = 600)
DF_clean      <- DF[match(DT,DF$DATE), ]

So long, as you can see the DF_clean looks already pretty good, but the last 4 rows are NAs, since the time steps where out of the regular 10 minutes interval. So i need to look wheather there is any data in between these time steps and shift them to the right 10 minutes interval.

for (var in DT[ which( is.na(DF_clean$DATE))]) {
  has.value <- DF$DATE > as.POSIXct(var, origin="1970-01-01") - tdiff & 
               DF$DATE < as.POSIXct(var, origin="1970-01-01")
  DF_clean[as.POSIXct(var, origin="1970-01-01"), ] <- DF[ has.value, ]
}

If i run the content of the for loop manually with var <- "2015-04-26 23:10:00 CEST", it works. Running the whole loop leads to the swapping memory. I think it has something to do with the use of POSIXct within the loop and within the [], but I couldn't figure out how to use the - tdiff otherwise.

I haven't tried any packages yet because I am acctually interested in a base R solution, after I was drawn to avoid any packages here before I don't really understand base R. ;)

You use `as.POSIXct(var, origin="1970-01-01")` as your row index it seems, which would be a very large number, leading to the creation of a very large data frame. I think you want to keep the index of your row instead and write to that. Or even better, vectorize the code. — mpjdem, Dec 12 '16 at 15:33
Well yes, that sounds good, but I don't know how! Can you show me the vectorized version? — Pelle, Dec 12 '16 at 15:42
Sorry I misunderstood, I thought you were looping over rows, but you were not. Still the problem is the same I think; your row index is very large when re-assigning. — mpjdem, Dec 12 '16 at 15:45
I acctually thought, I am looping over 4 rows: `> DT[ which( is.na(DF_clean$DATE))] "2015-04-26 23:10:00 CEST" "2015-04-26 23:20:00 CEST" "2015-04-26 23:30:00 CEST" "2015-04-26 23:40:00 CEST"` And I need to index these time steps within the DF_clean data frame. Problem for me was that they don't have the same row numbers. — Pelle, Dec 12 '16 at 15:49

score 2 · Accepted Answer · answered Dec 12 '16 at 15:59

2

Is that what you are looking for:

for (ind in which(is.na(DF_clean$DATE))) {
  has.value <- DF$DATE > as.POSIXct(DT[ind], origin="1970-01-01") - tdiff & 
    DF$DATE < as.POSIXct(DT[ind], origin="1970-01-01")
  DF_clean[ind, ] <- DF[ has.value, ]
}

answered Dec 12 '16 at 15:59

Christoph

6,841
4
37
89

Thanks that works! Would you call that "vectorized" now? And if somebody would be so kind: Why does this doesn't work when I ad it to the loop? `rownames(DF_clean[ind, ]) <- ind` – Pelle Dec 13 '16 at 09:23
@Pelle Well kind of vectorized... Perhaps you should read [this](http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) anyways;-) – Christoph Dec 13 '16 at 09:47
@Pelle To your question: ind is just the last value. If you like you can use `length(rownames(DF_clean))` giving `[1] 4313` and then `rownames(DF_clean) <- c(1:4313)`... – Christoph Dec 13 '16 at 09:53

Simple loop leads to memory overflow (base R)

1 Answers1