1

I have a dataframe as the one below. I want to combine the duplicates in the 'activity' column except from the ones called 'selection', and sum their value in the 'duration' column. I am doing it in R. I've tried using aggregate() but I couldn't find the way to not aggregate the 'selection' rows.

 # df - I used dput so you can have my df
test <- structure(list(activity = c("selection", "selection", "selection", 
"other", "inspection", "assignment", "inspection", "inspection", 
"inspection", "inspection"), workers = c("worker 1", "worker 1", 
"worker 1", "worker 34", "worker 6", "worker 5", "worker 2", 
"worker 2", "worker 2", "worker 2"), start_time = structure(c(1645396200, 
1645396200, 1645396200, 1645394352, 1645394155, 1645394100, 1645390080, 
1645476480, 1645562880, 1645649280), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), status = c("passed", "passed", "passed", "passed", 
"passed", "passed", "passed", "passed", "passed", "passed"), 
    duration = c(8.98333333333333, 9.69027777777778, 9.20555555555556, 
    0.557222222222222, 2.24527777777778, 1.61666666666667, 2.12166666666667, 
    1.32638888888889, 2.59861111111111, 0.765555555555556)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))
  
test 
# A tibble: 10 x 5
       activity   workers   start_time          status duration
       <chr>      <chr>     <dttm>              <chr>     <dbl>
     1 selection  worker 1  2022-02-20 22:30:00 passed    8.98 
     2 selection  worker 1  2022-02-20 22:30:00 passed    9.69 
     3 selection  worker 1  2022-02-20 22:30:00 passed    9.21 
     4 other      worker 34 2022-02-20 21:59:12 passed    0.557
     5 inspection worker 6  2022-02-20 21:55:55 passed    2.25 
     6 assignment worker 5  2022-02-20 21:55:00 passed    1.62 
     7 inspection worker 2  2022-02-20 20:48:00 passed    2.12 
     8 inspection worker 2  2022-02-21 20:48:00 passed    1.33 
     9 inspection worker 2  2022-02-22 20:48:00 passed    2.60 
    10 inspection worker 2  2022-02-23 20:48:00 passed    0.766
lovalery
  • 4,524
  • 3
  • 14
  • 28
MVAC
  • 118
  • 8
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please don't post data as images. We don't want to have to retype everything just to try out possible solutions. Please share your code attempt as well. Exactly what duplicates would be eliminated here? It seems all the duration values are unique. Please give the expected output for this sample input. – MrFlick Feb 25 '22 at 19:48
  • @MrFlick Thanks for the quick reply. The df is just a small part of my real df. So, I have to combine all the duplicates (and sum the duration) except for the ones called 'selection'. The selection duplicates should remain the same. Thanks for the 'reproducible example' link. It helped me a lot – MVAC Feb 25 '22 at 20:34

1 Answers1

1

Not sure to fully understand what you are looking for but I give it a try!

So, using the dplyr library, you could do:

Reprex

  • Code
library(dplyr)

test %>% 
  filter(activity != "selection") %>% 
  group_by(activity) %>% 
  summarise(workers = workers[1],
            start_time = start_time[1],
            status = status[1],
            duration = sum(duration)) %>% 
  bind_rows(test %>% filter(activity == "selection"))
  • Output
#> # A tibble: 6 x 5
#>   activity   workers   start_time          status duration
#>   <chr>      <chr>     <dttm>              <chr>     <dbl>
#> 1 assignment worker 5  2022-02-20 21:55:00 passed    1.62 
#> 2 inspection worker 6  2022-02-20 21:55:55 passed    9.06 
#> 3 other      worker 34 2022-02-20 21:59:12 passed    0.557
#> 4 selection  worker 1  2022-02-20 22:30:00 passed    8.98 
#> 5 selection  worker 1  2022-02-20 22:30:00 passed    9.69 
#> 6 selection  worker 1  2022-02-20 22:30:00 passed    9.21

Created on 2022-02-25 by the reprex package (v2.0.1)

lovalery
  • 4,524
  • 3
  • 14
  • 28