1

I have the dataset as follows, which group by ID:

ID, Activity, Duration
1, Reading,     20
1, Work,        40
1, Reading,     30
2, Home,        50
2, Writing,     30
2, Reading,     20
2, Writing,     30

And I want to get another column that tells us the activity with the highest duration, so person 1 should be reading because it takes 50 minutes, and person 2 should be writing because it takes 60 minutes. Below is an example of the desired output.

ID, Activity, Duration, Max_Actitvity
    1, Reading, 20, Reading
    1, Work,    40, Reading
    1, Reading, 30, Reading
    2, Home,    50, Writing
    2, Writing, 30, Writing
    2, Reading, 20, Writing
    2, Writing, 30, Writing
James Z
  • 12,209
  • 10
  • 24
  • 44
ail
  • 149
  • 7

1 Answers1

3

You can use the following code:

df <- read.table(text = "ID, Activity, Duration
1, Reading, 20
1, Work, 40
1, Reading, 30
2, Home, 50
2, Writing, 30
2, Reading, 20
2, Writing, 30", header = TRUE, sep = ",")

library(dplyr)
df %>%
  group_by(ID, Activity) %>%
  mutate(sum_Activity = sum(Duration)) %>%
  group_by(ID) %>%
  mutate(Max_Activity = Activity[which.max(sum_Activity)]) %>%
  select(-sum_Activity) %>%
  ungroup() 
#> # A tibble: 7 × 4
#>      ID Activity   Duration Max_Activity
#>   <int> <chr>         <int> <chr>       
#> 1     1 " Reading"       20 " Reading"  
#> 2     1 " Work"          40 " Reading"  
#> 3     1 " Reading"       30 " Reading"  
#> 4     2 " Home"          50 " Writing"  
#> 5     2 " Writing"       30 " Writing"  
#> 6     2 " Reading"       20 " Writing"  
#> 7     2 " Writing"       30 " Writing"

Created on 2022-07-21 by the reprex package (v2.0.1)

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • Thanks a lot. I am not sure why. When I tried the code sum_Activity add all duration, so it equals to 220, and max activity is reading. Any idea? – ail Jul 21 '22 at 08:53
  • 1
    Hi @ail, No problem! I am not completely sure what you mean. But I think you mean that you sum all the Duration values without grouping and then the max value is 220 and it selects Reading because it is the first value in the column I think. So the grouping part is important. Right? – Quinten Jul 21 '22 at 08:59
  • Thanks. I mean while I am using the code you gave to me, it did not calculate the sum _Activity for selected group, it gives a sum for all which is equal to 220. – ail Jul 21 '22 at 09:02
  • 2
    That happens when you don't use the first group_by. If you remove that, it will return 220 because then it just sums up the Duration column. So group_by is necessary. – Quinten Jul 21 '22 at 09:05
  • Thanks. I have already used group by and it did not work as well. – ail Jul 21 '22 at 09:06
  • 1
    Maybe you have installed the package `plyr` which will give you problems when using `dplyr`. Check this post: [dplyr issues when using group_by](https://stackoverflow.com/questions/21653295/dplyr-issues-when-using-group-bymultiple-variables). – Quinten Jul 21 '22 at 09:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246640/discussion-between-ail-and-quinten). – ail Jul 21 '22 at 09:20