0

Lets say we have, two time-series data.tables, one sampled by day, another by hour:

dtByDay
     EURO       TIME    ... and some other columns
    <num>     <POSc>
 1:  0.95 2017-01-20
 2:  0.97 2017-01-21
 3:  0.98 2017-01-22
 ...

dtByHour
                  TIME           TEMP   ... also some other columns
                <POSc>          <num>
1: 2017-01-20 00:00:00           22.45
2: 2017-01-20 01:00:00           23.50
3: 2017-01-20 02:00:00           23.50
...

and we need to merge them, so that to get all columns together. What's a nice what of doing it?

Evidently dtByDay[dtByHour] does not produce the desired outcome (as one could have wished) - you get `NA' in "EURO" column ...

h3rm4n
  • 4,126
  • 15
  • 21
IVIM
  • 2,167
  • 1
  • 15
  • 41
  • 1
    I think this is where you want `roll = TRUE` These posts may help: https://gormanalysis.com/r-data-table-rolling-joins/ , https://stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r – emilliman5 Jun 09 '17 at 20:29
  • You are not asking a programming question, are you? You are seeking a statistical method for merging coarser data with another dataframe. – M-- Jun 09 '17 at 20:50
  • 1
    You might find [this answer](https://stackoverflow.com/a/42360403/496488) helpful. – eipi10 Jun 09 '17 at 21:16
  • This is a purely programming question. I'm just aggregating data from different sources, so that I can started do data modeling from there. Cheers. – IVIM Jun 12 '17 at 19:01

2 Answers2

0

Seems like roll = TRUE might give you funny behavior if a date is present in one data frame but no the other. So I wanted to post this alternative:

Starting with your original data frames:

dtbyday <- data.frame( EURO = c(0.95,0.97,0.98),
                       TIME = c(ymd("2017-01-20"),ymd("2017-01-21"),ymd("2017-01-22")))
dtbyhour <- data.frame( TEMP = c(22.45,23.50,23.40),
                       TIME = c(ymd_hms("2017-01-21 00:00:00"),ymd_hms("2017-01-21 01:00:00"),ymd_hms("2017-01-21 02:00:00")))

I converted the byhour$TIME to the same format as the byday$TIME using lubridate functions

dtbyhour <- dtbyhour %>%
            rowwise() %>%
            mutate( TIME = ymd( paste( year(TIME), month(TIME), day(TIME), sep="-" ) ) )
dtbyhour

# A tibble: 3 x 2
   TEMP       TIME
  <dbl>     <date>
1 22.45 2017-01-20
2 23.50 2017-01-20
3 23.40 2017-01-20

NOTE: The date changed because of time zone issues.

Then use dplyr::full_join to join by TIME, which will keep all records, and impute values whenever possible. You'll need to aggregate byHour values on a particular day...I calculated the mean TEMP below.

new.dt <- full_join( dtbyday, dtbyhour, by = c("TIME") ) %>%
          group_by( TIME ) %>%
          summarize( EURO = unique( EURO ),
                     TEMP = mean( TEMP, na.rm = TRUE ) )

# A tibble: 3 x 3
        TIME  EURO     TEMP
      <date> <dbl>    <dbl>
1 2017-01-20  0.95 23.11667
2 2017-01-21  0.97      NaN
3 2017-01-22  0.98      NaN
CPak
  • 13,260
  • 3
  • 30
  • 48
0

Big thanks to comments above! - The solution is as easy as just adding roll=Inf argument when joining:

dtByHour[dtByDay, roll=Inf] 

That's exactly what I needed. It takes dtByDay value and use it for all hours of this day. The output (from my application) is shown below.

For other applications, you may also consider roll="nearest". This will take the closest (from midnight) dtByDay value for all hours before and after midnight:

dtByHour[dtByDay, roll="nearest"] 

enter image description here

IVIM
  • 2,167
  • 1
  • 15
  • 41