How to merge large hourly rainfall data

Question

I have 10 stations with 15 years of hourly rainfall data. All have some hours, even days jump in their series randomly. I want all the hourly time step for the 15 year duration are used as indices and then combine all the station data to one data frame, while giving NA ( or even average of some neighbours) to the jump of each stations. so any suggestions to do this in R?

For instance I have data frame rainfall_1

date                       station210
1994-01-01 00:00:00 0
1994-01-01 02:00:00 0
1994-01-01 03:00:00 0
1994-01-01 04:00:00 0.6
1994-01-01 06:00:00 2.6
1994-01-01 07:00:00 3.2

and the second station is rainfall_2

date                      station212
1994-01-01 00:00:00 0
1994-01-01 01:00:00 1.8
1994-01-01 02:00:00 1.8
1994-01-01 03:00:00 1.8
1994-01-01 04:00:00 1.4
1994-01-01 06:00:00 1.8

when I try to

merge(rainfall_1, rainfall_2, all=TRUE)
date               station_210  station_212
1994-01-01 00:00:00 0.0 0.0
1994-01-01 02:00:00 0.0 1.8
1994-01-01 03:00:00 0.0 1.8
1994-01-01 04:00:00 0.6 1.4
1994-01-01 06:00:00 2.6 1.8

some of the problems are it does miss the 2nd row of the second data frame(rainfall_2) and it miss the value at the 5th hour which is not present in both data frame. I was looking a solution where for the second time step (2 hour) to be included and to be NA or average of its neighbour and for the 5th hour time step to give the NA or average value of the neighbours for both.

Imagine if you have these time series data is for a thousands for many stations with some missing at random intervals .

Please help us help you by providing us with a reproducible example (i.e. code and example data), see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for details. — Paul Hiemstra, Jun 28 '13 at 06:54
Just to emphasize the need, R's time handling functions are ... let's say, different. Really, really need to see the data. Experience has shown that new users of SO rarely provide such data after polite requests, so am voting to close. But please show that I am wroong and edit you questions to prevent closure. — IRTFM, Jun 28 '13 at 07:02
It would be easier for responders if you also show the output from `dput(rainfall_1)` and `dput(rainfall_2)` . — G. Grothendieck, Jun 28 '13 at 13:15

Mark Miller · Answer 1 · 2013-07-06T21:34:53.567

I think this might do what you want. I am not sure why the final merged dataset begins at 3:00PM on Dec 31 instead of midnight Jan 1st. I suspect that has something to do with my computer's clock relative to GMT.

df.1 <- read.table(text = '
date       time     station210
1994-01-01 00:00:00 0
1994-01-01 02:00:00 0
1994-01-01 03:00:00 0
1994-01-01 04:00:00 0.6
1994-01-01 06:00:00 2.6
1994-01-01 07:00:00 3.2
', header = TRUE, stringsAsFactors=FALSE)

df.2 <- read.table(text = '
 date       time    station212
1994-01-01 00:00:00 0
1994-01-01 01:00:00 1.8
1994-01-01 02:00:00 1.8
1994-01-01 03:00:00 1.8
1994-01-01 04:00:00 1.4
1994-01-01 06:00:00 1.8
', header=TRUE, stringsAsFactors=FALSE)

cols <- c( 'date' , 'time' )

df.1$datetime <- apply( df.1[ , cols ] , 1 , paste , collapse = " " )
df.2$datetime <- apply( df.2[ , cols ] , 1 , paste , collapse = " " )

df.1 <- df.1[, c('datetime', 'station210')]
df.2 <- df.2[, c('datetime', 'station212')]

df.3 <- merge(df.1, df.2, by="datetime", all=TRUE)

df.3[order(df.3$datetime),]

df.3$datetime <- format(as.POSIXct(df.3$datetime, format = "%Y-%m-%d %H:%M:%S"),  "%Y-%m-%d %H:%M:%S" )
df.3

hour <- seq(0,60*60*24,by=60*60)

datetime <- as.POSIXlt(hour, origin="1994-01-01")

datetime <-  format( as.POSIXct(hour, origin="1994-01-01"), "%Y-%m-%d %H:%M:%S"  )

newdf <- merge(data.frame(datetime), df.3, all.x=TRUE, by="datetime")
newdf

              datetime station210 station212
1  1993-12-31 15:00:00         NA         NA
2  1993-12-31 16:00:00         NA         NA
3  1993-12-31 17:00:00         NA         NA
4  1993-12-31 18:00:00         NA         NA
5  1993-12-31 19:00:00         NA         NA
6  1993-12-31 20:00:00         NA         NA
7  1993-12-31 21:00:00         NA         NA
8  1993-12-31 22:00:00         NA         NA
9  1993-12-31 23:00:00         NA         NA
10 1994-01-01 00:00:00        0.0        0.0
11 1994-01-01 01:00:00         NA        1.8
12 1994-01-01 02:00:00        0.0        1.8
13 1994-01-01 03:00:00        0.0        1.8
14 1994-01-01 04:00:00        0.6        1.4
15 1994-01-01 05:00:00         NA         NA
16 1994-01-01 06:00:00        2.6        1.8
17 1994-01-01 07:00:00        3.2         NA
18 1994-01-01 08:00:00         NA         NA
19 1994-01-01 09:00:00         NA         NA
20 1994-01-01 10:00:00         NA         NA
21 1994-01-01 11:00:00         NA         NA
22 1994-01-01 12:00:00         NA         NA
23 1994-01-01 13:00:00         NA         NA
24 1994-01-01 14:00:00         NA         NA
25 1994-01-01 15:00:00         NA         NA

EDIT - July 6, 2013

Here is one way to handle more than two data frames.

Here are the data:

df.1 <- read.table(text = '
date       time     station210
1994-01-01 00:00:00 0
1994-01-01 02:00:00 0
1994-01-01 03:00:00 0
1994-01-01 04:00:00 0.6
1994-01-01 06:00:00 2.6
1994-01-01 07:00:00 3.2
', header = TRUE, stringsAsFactors=FALSE)

df.2 <- read.table(text = '
 date       time    station212
1994-01-01 00:00:00 0
1994-01-01 01:00:00 1.8
1994-01-01 02:00:00 1.8
1994-01-01 03:00:00 1.8
1994-01-01 04:00:00 1.4
1994-01-01 06:00:00 1.8
', header=TRUE, stringsAsFactors=FALSE)

df.3 <- read.table(text = '
 date       time    station214
1993-12-31 22:00:00 5.0
1993-12-31 23:00:00 2.0
1994-01-01 02:00:00 1.0
1994-01-01 04:00:00 3.0
1994-01-01 06:00:00 5.0
1994-01-01 08:00:00 4.0
', header=TRUE, stringsAsFactors=FALSE)

Create a list of data frames and create the variable datetime:

my.data <- sapply(paste('df.', seq(1,3,1), sep=''), get, environment(), simplify = FALSE) 

date.time <- function(x) { 
                      cols <- c( 'date' , 'time' )
                      x$datetime <- apply( x[ , cols ] , 1 , paste , collapse = " " )
                      x <- x[, 3:4]
                      return(x)
             }

my.list <- lapply(my.data, function(x) date.time(x))

Merge and sort the data frames in that list:

df.3 <- Reduce(function(...) merge(..., all=T), my.list)
df.3[order(df.3$datetime),]

Add missing dates and times to the merged data frame:

df.3$datetime <- format(as.POSIXct(df.3$datetime, format = "%Y-%m-%d %H:%M:%S"),  "%Y-%m-%d %H:%M:%S" )

hour <- seq(0,60*60*24,by=60*60)

datetime <- as.POSIXlt(hour, origin="1994-01-01")

datetime <-  format( as.POSIXct(hour, origin="1994-01-01"), "%Y-%m-%d %H:%M:%S"  )

newdf <- merge(data.frame(datetime), df.3, all.x=TRUE, by="datetime")
newdf

Here is code to replace missing observations from a station with the mean of the preceding and following observations from that same station. I am using nested for-loops which are likely highly inefficient. If I figure out a more efficient approach I will try to remember to post it here. If your data set is huge, these nested for-loops may take a very long time to run.

newdf2 <- newdf

for(i in 1:nrow(newdf)) {
     for(j in 2:ncol(newdf)) {

          if(i == 1 &                   is.na(newdf[i,j]))  newdf2[i,j] = newdf[i+1,j]
          if(i ==         nrow(newdf) & is.na(newdf[i,j]))  newdf2[i,j] = newdf[i-1,j]
          if(i >  1 & i < nrow(newdf) & is.na(newdf[i,j]))  newdf2[i,j] = mean(c(newdf[i-1,j], newdf[i+1,j]), na.rm=TRUE) 
          if(is.nan(newdf2[i,j]))                           newdf2[i,j] = NA

     }
}

cbind(newdf, newdf2)

Hi Mark Miller, thank you and that is working for me. But there is one question, is there a way we can have some average of the previous and the next time series value in place of NA? — Wuletawu Abera, Jun 28 '13 at 13:06
Hi Mark Miller, when i do it for more than two station, i am getting error message "Error in fix.by(by.x, x) : 'by' must specify column(s) as numbers, names or logical" . what do you suggest me to solve this problem ? Thanks in advance — Wuletawu Abera, Jun 28 '13 at 17:01
@Wuletawu Abera I now have tried to respond to your questions above. — Mark Miller, Jul 06 '13 at 22:00

score 0 · Answer 2 · answered Jun 28 '13 at 12:14

Assume rainfall_1 and rainfall_2 have POSIXct date columns. Now, convert the data frames to zoo objects and add a third zoo object z3 (equal to z2) to the example just to show its not limited to two inputs. Merge all three (zoo's merge handles multiway merges) to give zz and then merge the combined time series with a zero width grid, z0.

library(zoo)

# set up input zoo objects
z1 <- read.zoo(rainfall_1, FUN = identity)
z2 <- read.zoo(rainfall_2, FUN = identity)
z3 <- z2

zz <- merge(z1, z2, z3)
z0 <- zoo(, seq(start(zz), end(zz), by = "hour"))
zout <- merge(zz, z0)

This gives:

> zout
                     z1  z2  z3
1994-01-01 00:00:00 0.0 0.0 0.0
1994-01-01 01:00:00  NA 1.8 1.8
1994-01-01 02:00:00 0.0 1.8 1.8
1994-01-01 03:00:00 0.0 1.8 1.8
1994-01-01 04:00:00 0.6 1.4 1.4
1994-01-01 05:00:00  NA  NA  NA
1994-01-01 06:00:00 2.6 1.8 1.8
1994-01-01 07:00:00 3.2  NA  NA

You may wish to just leave this in zoo to take advantage of its other facilities too but if you do want to turn it back into a data frame:

library(ggplot2)
dfout <- fortify(zout)

How to merge large hourly rainfall data

2 Answers2