1

I need to check all the months in my weather data set to make sure there are values for each day at each station. If days are missing, or have -9999 then I would like to just use the mean. In my original code I will actually be using the closest weather station, but for testing purposes I'm just using the mean for that month.

I'm looking for a more efficient way to code this problem. How would I make this code more efficient? This seems like such a hack job, but I don't know another way to do it. I would appreciate any suggestions.

dput:

    df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "USC00036918", class = "factor"), 
    element = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L), .Label = c("DAPR", "MDPR", "PRCP", "SNOW", 
    "SNWD", "TMAX", "TMIN", "TOBS", "WT01", "WT03", "WT04", "WT05", 
    "WT06", "WT07", "WT08", "WT10", "WT11", "WT14", "WT16", "WT18"
    ), class = "factor"), year = c(1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1), day = c("01", "02", "03", "04", "05", "06", "07", 
    "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", 
    "18", "19", "21", "22", "23", "24", "25", "26", "28", "29", 
    "30", "31"), date = c("19000101", "19000102", "19000103", 
    "19000104", "19000105", "19000106", "19000107", "19000108", 
    "19000109", "19000110", "19000111", "19000112", "19000113", 
    "19000114", "19000115", "19000116", "19000117", "19000118", 
    "19000119", "19000121", "19000122", "19000123", "19000124", 
    "19000125", "19000126", "19000128", "19000129", "19000130", 
    "19000131"), value = c(19.94, 15.08, 12.92, 28.04, 39.92, 
    46.04, 44.96, 37.94, 50, 53.06, 42.98, 28.04, 33.98, 39.92, 
    -9999, 53.96, 53.06, 39.92, 37.94, 33.08, 30.92, 35.06, 42.08, 
    37.04, 23, 26.06, 12.92, 19.94, 21.92)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -29L), .Names = c("id", 
"element", "year", "month", "day", "date", "value"))

Code:

# Get number of days in month
c_month <- paste0(df[1,3], "-", df[1,4], "-01" )
n_month <- paste0(df[1,3], "-", as.numeric(1+df[1,4]), "-01")
days <- as.numeric(difftime( as.Date(n_month), as.Date(c_month) ))
days_count <- as.numeric(1:days)


# Check df for number of days
n_days <- nrow(df)
n_days_count <- as.numeric(df$day)

# Find differences
diff <- setdiff(days_count, n_days_count)
miss <- which(df$value == -9999)

# If -9999 exists, replace with mean
if (length(miss >= 1)) {
  check <- filter(df, value != -9999)
  df[miss,7] = mean(check$value)
}

# If days are missing, insert with mean  
if (n_days < days){
  for (i in 1:length(diff)){
    add_df <- df[1,]
    add_df$day <- diff[i]
    add_df$date <- paste0(add_df$year, add_df$month, diff[i])
    add_df$value <- mean(check$value)
    df <- rbind(df, add_df)
  }
}
Vedda
  • 7,066
  • 6
  • 42
  • 77
  • I don't really follow what's happening with your code here, but I usually use `seq` to enumerate all the dates and then merge on that, leaving `NA` on days that were in the `seq` but not in the original data. – Frank Nov 11 '15 at 19:39
  • I am also unsure of exactly what you're doing because the way that you've phrased your question is unclear. But it's a possible duplicate: http://stackoverflow.com/questions/33003819/how-to-insert-missing-observations-on-a-data-frame/ – alexwhitworth Nov 11 '15 at 19:41

1 Answers1

1

Here's a solution using a little dplyr.

First we make a date column, by pasting together your year, month and day columns.

Then we make a dataframe with a sequence from your minimum to maximum date.

Then we merge the two to get the missing days added to your data:

df$date <- as.Date(paste0(df$year, "-", df$month, "-", df$day))
tomerge <- data.frame(date = seq(min(df$date), max(df$date), "days"))
newdat <- merge(tomerge, df, by = "date", all.x = TRUE)

Next we group by year and month, and fill any NAs in with the mean of that month:

library(dplyr)
library(lubridate)

newdat %>% mutate(value = ifelse(value == -9999, NA, value)) %>%
           group_by(year = year(date), month = month(date)) %>%
           mutate(value = ifelse(is.na(value), mean(value, na.rm=TRUE), value))
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • 1
    Thanks for the answer, but I just checked this and it didn't fix the -9999 or fill in the missing dates. Notice that day 20 and 27 are missing and day 15 has -9999. – Vedda Nov 11 '15 at 20:02
  • This fixed the -9999, but the inserted days are now NA. Any idea? – Vedda Nov 11 '15 at 20:10
  • This is getting close. `Error: could not find function "year"` – Vedda Nov 11 '15 at 20:16
  • How would you fill in the id, element, and day? I love `dplyr` and this solution is perfect – Vedda Nov 11 '15 at 20:18
  • id use `tidyr::fill(id, element)` and `mutate(day = day(date))`. If you have missing first of the month data, `tidyr::fill` will break though -you could look into `tidyr::complete` – jeremycg Nov 11 '15 at 20:20
  • Yeah that will be a problem. Maybe i'll just use base R to fill it in by finding the NA's and just filling it in. – Vedda Nov 11 '15 at 20:21