1

I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.

So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).

The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.

library("RpostgreSQL")
library("tidyverse")
library("lubridate")

con <- dbConnect(PostgreSQL(), user = "postgres", 
    dbname="", password = "", host = "localhost", port= "5432")

    out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")

    out %>% 
        group_by(ID)%>% 
        mutate(DATE = as.Date(DATE)) %>% 
        complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))

Ignore errors in the connect line.

Thanks in advance.

Edit1 Sample data

ID  DATE     VALUE
1   2015-10-01  1
1   2015-10-08  1
1   2015-10-15  1
1   2015-10-29  1
2   1956-01-01  1
2   1956-01-15  1
2   1956-01-22  1
3   1982-01-01  1
3   1982-01-15  1
3   1982-01-22  1
3   1982-01-29  1

Excpected output

ID  DATE     VALUE
1   2015-10-01  1
1   2015-10-08  1
1   2015-10-15  1
1   2015-10-22  NA
1   2015-10-29  1
2   1956-01-01  1
2   1956-01-08  NA
2   1956-01-15  1
2   1956-01-22  1
3   1982-01-01  1
3   1982-01-08  NA
3   1982-01-15  1
3   1982-01-22  1
3   1982-01-29  1
Jbt
  • 11
  • 3
  • Trying to duplicate what you have above, but I'm getting error: `Error in PostgreSQL() : could not find function "PostgreSQL"` I have `library(DBI) ` and `library(lubridate)` – markhogue Apr 07 '20 at 14:10
  • 1
    Welcome to StackOverflow! Could you make your example [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to make it easier to help you? (providing sample input with `dput()`, expected output, and code that can be copy-pasted and run as is) – Aurèle Apr 07 '20 at 14:10
  • I guess the question is how to make `tidyr::complete` function, so as @Aurèle suggests, some sample data would suffice – markhogue Apr 07 '20 at 14:13
  • Hi and thanks for quick response! I added some sample data (not the original, just randomized). @Aurèle hope the format of the sample data is ok. The only thing missing from the code is the librarys, i add those. – Jbt Apr 07 '20 at 14:57
  • @markhogue I think you are missing the library RPostgreSQL. – Jbt Apr 07 '20 at 14:58
  • @Jbt Right. But can't be connect anyway as you anticipated. – markhogue Apr 07 '20 at 15:06
  • @markhogue I have removed my information from the con <- line (e.g password). Maybe you could try with the sample data i provided instead? – Jbt Apr 07 '20 at 15:13
  • I inputted the sample data okay, but don't understand what you really want. The `complete` line (something I've never used, btw) doesn't seem to agree with your desired output. This line `length(seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))` gives 5269. Maybe that matches your full data set? – markhogue Apr 07 '20 at 15:25
  • @markhogue I want to fill the missing dates in the time series with NA values, but i only want to fill the series which is between the dates of each specific time serie (based on id). So there is 2000 series and all with different lenght. My code right now provides me NA values across the entire period 1915-2016, but all series doesn't span that long, some are only between 2000-2016 etc, so for these I only wanna fill Na values where there are missing values between 2000-2016. – Jbt Apr 07 '20 at 16:27

1 Answers1

0

Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.


library(lubridate)
library(dplyr)
library(tidyr)

a <- "ID  DATE     VALUE
1   2015-10-01  1
1   2015-10-08  1
1   2015-10-15  1
1   2015-10-29  1
2   1956-01-01  1
2   1956-01-15  1
2   1956-01-22  1
3   1982-01-01  1
3   1982-01-15  1
3   1982-01-22  1
3   1982-01-29  1"

df <- read.table(text = a, header = TRUE)

big_df1 <- df %>% 
  filter(ID == 1)%>% 
  mutate(DATE = as.Date(DATE)) %>% 
  tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))

big_df2 <- df %>% 
  filter(ID == 2)%>% 
  mutate(DATE = as.Date(DATE)) %>% 
  tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))

big_df3 <- df %>% 
  filter(ID == 3)%>% 
  mutate(DATE = as.Date(DATE)) %>% 
  tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))

big_df <- rbind(big_df1, big_df2, big_df3)
big_df

   DATE          ID VALUE
   <date>     <int> <int>
 1 2015-10-01     1     1
 2 2015-10-08     1     1
 3 2015-10-15     1     1
 4 2015-10-22    NA    NA
 5 2015-10-29     1     1
 6 1956-01-01     2     1
 7 1956-01-08    NA    NA
 8 1956-01-15     2     1
 9 1956-01-22     2     1
10 1982-01-01     3     1
11 1982-01-08    NA    NA
12 1982-01-15     3     1
13 1982-01-22     3     1
14 1982-01-29     3     1
markhogue
  • 1,056
  • 1
  • 6
  • 16
  • Thanks for your time and patience with me. The problem is still there by running this. I don't want to receive dates with NA values outside of the end and start date of each individual ID. For example in my sample data here, For ID 1: I only want the missing date to be filled between 2015-10-01 and 2015-10-29, I don't want additional NA values outside of the start and end date. I need some kind of dynamic min and max value for each id. – Jbt Apr 07 '20 at 17:50
  • @Jbt New edited answer hopefully addresses your comment. If not, we can keep at it :-) – markhogue Apr 07 '20 at 18:58