0

I have the following data

structure(list(station = c("61WOL2", "61WOL2", "61WOL2", "61WOL2", 
"61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", 
"61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", "61WOL2", 
"61WOL2", "61WOL2"), pollutant = c(17201L, 17201L, 17201L, 17201L, 
17201L, 17201L, 17201L, 17201L, 17201L, 17201L, 17201L, 17201L, 
17201L, 17201L, 17201L, 17201L, 17201L, 17201L, 17201L, 17201L
), tag = c("002", "002", "002", "002", "002", "002", "002", "002", 
"002", "002", "002", "002", "002", "002", "002", "002", "002", 
"002", "002", "002"), concentration = c(NA, 0.42, 0.42, 0.42, 
0.42, 0.42, 0.42, 0.72, 0.72, 0.72, 0.72, 0.72, 0.72, 0.72, 0.31, 
0.31, 0.31, 0.31, 0.31, 0.31), date = structure(c(1514764800, 
1514851200, 1514937600, 1515024000, 1515110400, 1515196800, 1515283200, 
1515369600, 1515456000, 1515542400, 1515628800, 1515715200, 1515801600, 
1515888000, 1515974400, 1516060800, 1516147200, 1516233600, 1516320000, 
1516406400), tzone = "UTC", class = c("POSIXct", "POSIXt"))), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

which displays as

# A tibble: 20 x 5
   station pollutant tag   concentration date               
   <chr>       <int> <chr>         <dbl> <dttm>             
 1 61WOL2      17201 002           NA    2018-01-01 00:00:00
 2 61WOL2      17201 002            0.42 2018-01-02 00:00:00
 3 61WOL2      17201 002            0.42 2018-01-03 00:00:00
 4 61WOL2      17201 002            0.42 2018-01-04 00:00:00
 5 61WOL2      17201 002            0.42 2018-01-05 00:00:00
 6 61WOL2      17201 002            0.42 2018-01-06 00:00:00
 7 61WOL2      17201 002            0.42 2018-01-07 00:00:00
 8 61WOL2      17201 002            0.72 2018-01-08 00:00:00
 9 61WOL2      17201 002            0.72 2018-01-09 00:00:00
10 61WOL2      17201 002            0.72 2018-01-10 00:00:00
11 61WOL2      17201 002            0.72 2018-01-11 00:00:00
12 61WOL2      17201 002            0.72 2018-01-12 00:00:00
13 61WOL2      17201 002            0.72 2018-01-13 00:00:00
14 61WOL2      17201 002            0.72 2018-01-14 00:00:00
15 61WOL2      17201 002            0.31 2018-01-15 00:00:00
16 61WOL2      17201 002            0.31 2018-01-16 00:00:00
17 61WOL2      17201 002            0.31 2018-01-17 00:00:00
18 61WOL2      17201 002            0.31 2018-01-18 00:00:00
19 61WOL2      17201 002            0.31 2018-01-19 00:00:00
20 61WOL2      17201 002            0.31 2018-01-20 00:00:00

I would like to transform it as

station pollutant tag   concentration start_date          end_date     
<chr>       <int> <chr>         <dbl> <dttm>              <dttm>
61WOL2      17201 002            0.42 2018-01-02 00:00:00 2018-01-07 00:00:00
61WOL2      17201 002            0.72 2018-01-08 00:00:00 2018-01-14 00:00:00
61WOL2      17201 002            0.31 2018-01-15 00:00:00 2018-01-20 00:00:00

Note that the variables 'station' & 'pollutant' can take several values.

How can I achieve this (with dplyr & pipe operator, for example) ?

Many thanks.

Alessandro
  • 129
  • 1
  • 8

1 Answers1

0

Drop the NA values from concentration and for each unique value of concentration get the minimum and maximum date value.

library(dplyr)
library(data.table)

df %>%
  filter(!is.na(concentration)) %>%
  group_by(station, pollutant, tag, grp = rleid(concentration)) %>%
  summarise(concentration = first(concentration), 
            start_date = min(date), 
            end_date = max(date), .groups = 'drop') %>%
  select(-grp)

# station pollutant tag   concentration start_date          end_date           
#  <chr>       <int> <chr>         <dbl> <dttm>              <dttm>             
#1 61WOL2      17201 002            0.42 2018-01-02 00:00:00 2018-01-07 00:00:00
#2 61WOL2      17201 002            0.72 2018-01-08 00:00:00 2018-01-14 00:00:00
#3 61WOL2      17201 002            0.31 2018-01-15 00:00:00 2018-01-20 00:00:00
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Wonderful ! Just to be consistent with my question, I added `concentration,` before `grp` to also get the concentration values. If you edit your answer, I will accept it on the site. Thanks. – Alessandro Jul 23 '21 at 09:53
  • Great ! However, when running it on my full dataset, I noticed a case which is not included in the previous reduced dataset. I have updated my question. – Alessandro Jul 23 '21 at 13:01
  • I don't really understand why the 5th row ends on `2018-01-30` when there are rows/dates with the same `concentration` value till `2018-02-04`. Also usually it is better to ask a new question when you have additional constraints or an extension of the original question posted. – Ronak Shah Jul 24 '21 at 01:11
  • My bad ! I have corrected my desired output. Sorry about that. – Alessandro Jul 24 '21 at 08:50
  • Reverted to my initial question and accepted your answer. I will ask a new question with my new requirements. Thanks. – Alessandro Jul 27 '21 at 14:33
  • Just in case you have any clue for this [one](https://stackoverflow.com/questions/68547124/how-to-compact-block-duplicates-of-variables-from-a-date-ordered-dataset-into-a) – Alessandro Jul 28 '21 at 10:14