1

In reality I have a really messy situation. I have 60 single tables which contain data between the dates 2009-01-01 to 2017-09-30. But the values are not continuous for each day. There is data for one day per month in a few tables or in three days intervals. Sometimes there is a value for every day per month.

I want to find out for which date per year the frequency of the data is highest. I need this for interpolations afterwards.

My idea: I build a data frame and in the first column are the dates from 2009-01-01 to 2017-09-30 continuously. Now I want to fill up this data frame with the 60 tables where the data is not continous.

I need a code for matching data to the right date inside the data frame WholeData(see example). And I don't need the date of the single tables anymore, because it's already in the first column.

Example code simplified:

    df1 <- sample(seq(as.Date('2009-01-01'), as.Date('2009-09-30'), by = "day"), 
    12)
    df1 <- sort(df1)
    expenses1 <- sample(180, 12)

    df1 <- data.frame(df1, expenses1)

    df2 <- sample(seq(as.Date('2009-01-01'), as.Date('2009-09-30'), by = "day"), 
    12)
    df2 <- sort(df2)

    expenses2 <- sample(180, 12)

    df2 <- data.frame(df2, expenses2)

    WholeData <- seq(as.Date("2009-01-01"), by = 1, as.Date("2009-09-30"))
    df <- data.frame(WholeData)

df1 and df2 standing for my 60 messy tables. Time interval is reduced, too.

Ekatef
  • 1,061
  • 1
  • 9
  • 12
Hark
  • 115
  • 1
  • 11

1 Answers1

1

First of all, I would recommend organize all your data frames into a list:

data_list <- list(df, df1, df2)

Here is a perfect explanation why it is important and which more advanced (and scalable!) approaches may be used.

Besides, it makes sense to set the same name for all the columns which contains the date values:

for (i in seq(along.with = data_list)) {
    colnames(data_list[[i]])[1] <- "date"
}

The "date" column will be a key column for further joint of the data frames.

Now, when preprocessing is done, you may build the final data frame choosing one of the available methods.

# with base R
res_1 <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "date", all.x = TRUE),
        data_list)

#using tidyverse tools
library(tidyverse) 

# with purr package
res_2 <- data_list %>% purrr::reduce(full_join, by = "date")

# with dplyr package
res_3 <- data_list %>%
    Reduce(function(dtf1, dtf2) dplyr::full_join(dtf1, dtf2, by = "date"), .)
Ekatef
  • 1,061
  • 1
  • 9
  • 12