R: Group_by function is not aggregating data correctly with sum & max/min date

Question

I have data with 6 Variables : EmployeeID, JobID, Name, JobLocation, Date, and HoursWorked. I want to group my data by EmployeeID and JobID (i.e. find all records with the same EmployeeID and JobID in one row), and then find the minimum and maximum date by group, and the sum of all HoursWorked between those dates. I want my data to end up with columns: EmployeeID, JobID, Name JobLocation, MinDate, MaxDate, TotalHoursWorked.

I have tried this so far but the MinDate,MaxDate, and TotalHoursWorked are showing the same date for every record.

Data$EmployeeID<- as.factor(Data$EmployeeID) 
Data$JobID<- as.factor(Data$JobID) 
Data$Date<- as.factor(Data$Date)
Data$Date<- as.Date(Data$Date,format="%m/%d/%Y")
Data$HoursWorked<-as.numeric(Data$HoursWorked)

Data<-Data[c("EmployeeID", "Name","JobID", "JobLocation", "Date", "HoursWorked")]
Data<- Data%>% 
  group_by(Data$EmployeeID,Data$JobID, Data$Name,Data$JobLocation) %>%
  summarize(TotalHoursWorked = sum(Data$HoursWorked)) %>%
  mutate(MaxDate=max(Data$Date), MinDate=min(Data$Date))

Output of sample(Data) without the "Name" column:

> sample(Data)
# A tibble: 1,000 x 5
   EmployeeID HoursWorked JobID           Date       JobLocation
   <fct>            <dbl> <fct>           <date>     <chr>         
 1 32589              4   B3031-002513-00 2016-03-14 #             
 2 32590              8   B3031-002562-00 2016-04-08 #             
 3 32591              9   B3031-002564-00 2016-04-05 #             
 4 32591              2.5 B3031-002564-00 2016-04-06 #             
 5 32591              3   B3031-002562-00 2016-04-07 #             
 6 32591              7.5 B3031-002562-00 2016-04-08 #             
 7 32605              0   B3031-002348-00 2016-01-04 #             
 8 32605              3   B3031-002419-00 2016-01-04 #             
 9 32605              0   B3031-002348-00 2016-01-05 #             
10 32605              3   B3031-002419-00 2016-01-05 #             
# ... with 990 more rows

And output after I run the group_by code:

> sample(Data)
# A tibble: 80 x 6
   MaxDate    `Data$JobID`    MinDate    `Data$\`Job Location\`` TotalHoursWorked `Data$EmployeeID`
   <date>     <fct>           <date>     <chr>                              <dbl> <fct>            
 1 2016-07-29 B3031-002513-00 2016-01-04 #                                  3288. 32589            
 2 2016-07-29 B3031-002562-00 2016-01-04 #                                  3288. 32590            
 3 2016-07-29 B3031-002562-00 2016-01-04 #                                  3288. 32591            
 4 2016-07-29 B3031-002564-00 2016-01-04 #                                  3288. 32591            
 5 2016-07-29 B3031-002348-00 2016-01-04 #                                  3288. 32605            
 6 2016-07-29 B3031-002419-00 2016-01-04 #                                  3288. 32605            
 7 2016-07-29 B3031-002445-00 2016-01-04 #                                  3288. 32605            
 8 2016-07-29 B3031-002502-00 2016-01-04 #                                  3288. 32605            
 9 2016-07-29 B3031-002504-00 2016-01-04 #                                  3288. 32605            
10 2016-07-29 B3031-002505-00 2016-01-04 #                                  3288. 32605            
# ... with 70 more rows

After the pipe `%>%` you don't need `Data$`, remove it and the problem many times goes away. — Rui Barradas, Mar 28 '19 at 19:08
That worked! However, EmployeeID and JobID are still not being grouped, and the sum is not working. Basically TotalHoursWorked is still showing values for HoursWorked instead of aggregating all HoursWorked between the min and max dates. — Veronica, Mar 28 '19 at 19:50
I think it is because I am grouping by multiple columns when I really just want to group by EmployeeID and JobID. Is there a way to project all the columns but only group/aggregate by those two? — Veronica, Mar 28 '19 at 20:26
Can you post sample data? Please edit **the question** with the output of `dput(Data)`. Or, if it is too big with the output of `dput(head(Data, 20))`. — Rui Barradas, Mar 28 '19 at 20:30
I can't do that as it is private data. Is there something else I can do? — Veronica, Mar 28 '19 at 20:34
Ok, but maybe you can post data creation code, that more or less gives us an idea of the data. R's RNG functions such as `rnorm` or `sample` are great for that. — Rui Barradas, Mar 28 '19 at 20:38
@RuiBarradas I updated my question with the output for `sample` — Veronica, Mar 28 '19 at 20:51

score 0 · Accepted Answer · answered Mar 28 '19 at 21:03

It is in fact very simple, you were using summarise and mutate when you should have only used summarise.

This first intruction is probably not needed, I run it to coerce the Date column after reading as I read in the data below.

Data$Date <- as.Date(Data$Date)

Now the solution.

library(tidyverse)

Data %>%
  group_by(EmployeeID, JobID) %>%
  summarise(TotalHoursWorked = sum(HoursWorked),
            MaxDate = max(Date), MinDate = min(Date))

Data.

Data <- read.table(text = "
EmployeeID HoursWorked JobID           Date       JobLocation
  1 32589              4   B3031-002513-00 2016-03-14 #             
2 32590              8   B3031-002562-00 2016-04-08 #             
3 32591              9   B3031-002564-00 2016-04-05 #             
4 32591              2.5 B3031-002564-00 2016-04-06 #             
5 32591              3   B3031-002562-00 2016-04-07 #             
6 32591              7.5 B3031-002562-00 2016-04-08 #             
7 32605              0   B3031-002348-00 2016-01-04 #             
8 32605              3   B3031-002419-00 2016-01-04 #             
9 32605              0   B3031-002348-00 2016-01-05 #             
10 32605              3   B3031-002419-00 2016-01-05 #   
", header = TRUE, comment.char = "")

Thank you! That worked, but how can I project the rest of the columns with the data? JobLocation is not showing up, only EmployeeID,JobID,TotalHoursWorked,MaxDate, and MinDate. — Veronica, Mar 28 '19 at 21:10
@Veronica That is due to `summarise`, it keeps only the columns the data is grouped by and the columns it computes. See if [this](https://stackoverflow.com/questions/46553514/applying-group-by-and-summarisesum-but-keep-columns-with-non-relevant-conflict) helps. — Rui Barradas, Mar 28 '19 at 22:28

R: Group_by function is not aggregating data correctly with sum & max/min date

1 Answers1