2

I have a timeseries dataset with 'n' number of columns. In the dataset, I would like to filter and remove the hours for which the value in a column changed within the hour. In other words, I want to keep the hours that has unchanged value.

Some info about the data:

  • 1-minute data with date in POSCIXT format (%Y-%m-%d %H:%M:%S)

  • The data is recorded for every 5 minutes, so for the time, for which the data is not recorded is filled with NA

    data1<- structure(list(PDT = structure(c(1668927600, 1668927660, 
       1668927720, 
      1668927780, 1668927840, 1668927900, 1668927960, 1668928020, 1668928080, 
      1668928140, 1668928200, 1668928260, 1668928320, 1668928380, 1668928440, 
      1668928500, 1668928560, 1668928620, 1668928680, 1668928740, 1668928800, 
      1668928860, 1668928920, 1668928980, 1668929040, 1668929100, 1668929160, 
      1668929220, 1668929280, 1668929340, 1668929400, 1668929460, 1668929520, 
      1668929580, 1668929640, 1668929700, 1668929760, 1668929820, 1668929880, 
      1668929940, 1668930000, 1668930060, 1668930120, 1668930180, 1668930240, 
      1668930300, 1668932640, 1668932700, 1668932760, 1668932820, 1668932880, 
      1668932940, 1668933000, 1668933060, 1668933120, 1668933180, 1668933240, 
      1668933300, 1668933360, 1668933720, 1668933780, 1668933840, 1668933900, 
      1668933960, 1668934020, 1668934080, 1668934140, 1668934200, 1668934260, 
      1668934320, 1668934380, 1668934440, 1668934500, 1668934560, 1668934620, 
      1668934680, 1668934740, 1668934800), class = c("POSIXct", "POSIXt"
      ), tzone = "UTC"), Date = structure(c(1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 
      1668902400, 1668902400, 1668902400, 1668902400), class = c("POSIXct", 
      "POSIXt"), tzone = "UTC"), DayOfWeek = c("Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", 
      "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday"), 
          Month = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
          11, 11, 11, 11, 11, 11), Day = c(20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 
          20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Hour_hr = c(7, 
          7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
          7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
          7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
          8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
          9), Minute = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
          13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
          28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
          43, 44, 45, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
          36, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
          56, 57, 58, 59, 0), ColA = c(NA, NA, NA, 65.5, NA, NA, NA, 
          NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, 
          NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, 
          NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 
          NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, 
          NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 
          65.7, NA, NA)), row.names = 3041:3118, class = "data.frame")
    

Expected output:

In the above example, I want to exclude hour 8 from my dataset, as the value in ColA is not constant.

I have a feeling that group_by() and filter() from dplyr might do the job, but I am not sure about the function to find the unchanged values within an hour.

Any help regarding this is much appreciated. Thanks.

Juan C
  • 5,846
  • 2
  • 17
  • 51
Karthik
  • 117
  • 7
  • Can you add your expected output? It's not entirely clear to me – Juan C May 09 '23 at 16:18
  • Hi @JuanC, In the above example, I don't want to exclude hour 0 from the time series data, as the value in ColA changed from 66 to 67. If the value in ColA remains the same for hour 0, I would like to keep the hour in my time series data. – Karthik May 09 '23 at 16:24
  • Could you please use `dput()` to include some sample data and make this question [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for others? – margusl May 09 '23 at 16:30
  • hey @Karthik, you told me you don't want to exclude (you want to keep) data when the value changes and you want to keep it when it doesn't change. Which of both are you looking for? Do you want to keep the whole series when one of the options happen? That's why we use reproducible examples, because you can leave a lot to the imagination – Juan C May 09 '23 at 16:43
  • Hi @JuanC, thanks. I have added a reproducible example. Yes, I want to keep the whole time series data and just want to exclude the hours for which the value is not constant throughout the hour. Thanks – Karthik May 09 '23 at 16:46

1 Answers1

1

This does it:

data1 %>% group_by(Hour_hr)  %>% filter(n_distinct(ColA) < 3)

Checking results:

count(data1, Hour_hr)

  Hour_hr     n
    <dbl> <int>
1       7    46
2       9     1

This will keep colA if there's only one numerical value or no numerical values (NA), keeping hour 7 and 9.

Equivalently you could do:

data1 %>% group_by(Hour_hr)  %>% filter(n_distinct(ColA, na.rm = T) < 2)
Juan C
  • 5,846
  • 2
  • 17
  • 51
  • 1
    Thanks :) Really appreciate your prompt response. It worked – Karthik May 09 '23 at 16:54
  • Nice, I'm glad it did! – Juan C May 09 '23 at 16:56
  • do the numbers 3, and 2 in your answer correspond to the no.of hours? – Karthik May 09 '23 at 17:13
  • 1
    Nope, it's the number of possible different values of ColA within an hour. So in the first answer it accepts NA and one extra value, in the second it ignores NAs, so it accepts only one unique value for ColA. I'm not using `n_distinct(ColA, na.rm = T) == 1` because that leaves behind the hour 9, because there's no values – Juan C May 09 '23 at 18:27