0

I have data with 6 Variables : EmployeeID, JobID, Name, JobLocation, Date, and HoursWorked. I want to group my data by EmployeeID and JobID (i.e. find all records with the same EmployeeID and JobID in one row), and then find the minimum and maximum date by group, and the sum of all HoursWorked between those dates. I want my data to end up with columns: EmployeeID, JobID, Name JobLocation, MinDate, MaxDate, TotalHoursWorked.

I have tried this so far but the MinDate,MaxDate, and TotalHoursWorked are showing the same date for every record.

Data$EmployeeID<- as.factor(Data$EmployeeID) 
Data$JobID<- as.factor(Data$JobID) 
Data$Date<- as.factor(Data$Date)
Data$Date<- as.Date(Data$Date,format="%m/%d/%Y")
Data$HoursWorked<-as.numeric(Data$HoursWorked)

Data<-Data[c("EmployeeID", "Name","JobID", "JobLocation", "Date", "HoursWorked")]
Data<- Data%>% 
  group_by(Data$EmployeeID,Data$JobID, Data$Name,Data$JobLocation) %>%
  summarize(TotalHoursWorked = sum(Data$HoursWorked)) %>%
  mutate(MaxDate=max(Data$Date), MinDate=min(Data$Date))

Output of sample(Data) without the "Name" column:

> sample(Data)
# A tibble: 1,000 x 5
   EmployeeID HoursWorked JobID           Date       JobLocation
   <fct>            <dbl> <fct>           <date>     <chr>         
 1 32589              4   B3031-002513-00 2016-03-14 #             
 2 32590              8   B3031-002562-00 2016-04-08 #             
 3 32591              9   B3031-002564-00 2016-04-05 #             
 4 32591              2.5 B3031-002564-00 2016-04-06 #             
 5 32591              3   B3031-002562-00 2016-04-07 #             
 6 32591              7.5 B3031-002562-00 2016-04-08 #             
 7 32605              0   B3031-002348-00 2016-01-04 #             
 8 32605              3   B3031-002419-00 2016-01-04 #             
 9 32605              0   B3031-002348-00 2016-01-05 #             
10 32605              3   B3031-002419-00 2016-01-05 #             
# ... with 990 more rows

And output after I run the group_by code:

> sample(Data)
# A tibble: 80 x 6
   MaxDate    `Data$JobID`    MinDate    `Data$\`Job Location\`` TotalHoursWorked `Data$EmployeeID`
   <date>     <fct>           <date>     <chr>                              <dbl> <fct>            
 1 2016-07-29 B3031-002513-00 2016-01-04 #                                  3288. 32589            
 2 2016-07-29 B3031-002562-00 2016-01-04 #                                  3288. 32590            
 3 2016-07-29 B3031-002562-00 2016-01-04 #                                  3288. 32591            
 4 2016-07-29 B3031-002564-00 2016-01-04 #                                  3288. 32591            
 5 2016-07-29 B3031-002348-00 2016-01-04 #                                  3288. 32605            
 6 2016-07-29 B3031-002419-00 2016-01-04 #                                  3288. 32605            
 7 2016-07-29 B3031-002445-00 2016-01-04 #                                  3288. 32605            
 8 2016-07-29 B3031-002502-00 2016-01-04 #                                  3288. 32605            
 9 2016-07-29 B3031-002504-00 2016-01-04 #                                  3288. 32605            
10 2016-07-29 B3031-002505-00 2016-01-04 #                                  3288. 32605            
# ... with 70 more rows
Veronica
  • 145
  • 12
  • 2
    After the pipe `%>%` you don't need `Data$`, remove it and the problem many times goes away. – Rui Barradas Mar 28 '19 at 19:08
  • That worked! However, EmployeeID and JobID are still not being grouped, and the sum is not working. Basically TotalHoursWorked is still showing values for HoursWorked instead of aggregating all HoursWorked between the min and max dates. – Veronica Mar 28 '19 at 19:50
  • I think it is because I am grouping by multiple columns when I really just want to group by EmployeeID and JobID. Is there a way to project all the columns but only group/aggregate by those two? – Veronica Mar 28 '19 at 20:26
  • Can you post sample data? Please edit **the question** with the output of `dput(Data)`. Or, if it is too big with the output of `dput(head(Data, 20))`. – Rui Barradas Mar 28 '19 at 20:30
  • I can't do that as it is private data. Is there something else I can do? – Veronica Mar 28 '19 at 20:34
  • Ok, but maybe you can post data creation code, that more or less gives us an idea of the data. R's RNG functions such as `rnorm` or `sample` are great for that. – Rui Barradas Mar 28 '19 at 20:38
  • @RuiBarradas I updated my question with the output for `sample` – Veronica Mar 28 '19 at 20:51
  • Done, see if the answer does what you want. – Rui Barradas Mar 28 '19 at 21:03

1 Answers1

0

It is in fact very simple, you were using summarise and mutate when you should have only used summarise.

This first intruction is probably not needed, I run it to coerce the Date column after reading as I read in the data below.

Data$Date <- as.Date(Data$Date)

Now the solution.

library(tidyverse)

Data %>%
  group_by(EmployeeID, JobID) %>%
  summarise(TotalHoursWorked = sum(HoursWorked),
            MaxDate = max(Date), MinDate = min(Date))

Data.

Data <- read.table(text = "
EmployeeID HoursWorked JobID           Date       JobLocation
  1 32589              4   B3031-002513-00 2016-03-14 #             
2 32590              8   B3031-002562-00 2016-04-08 #             
3 32591              9   B3031-002564-00 2016-04-05 #             
4 32591              2.5 B3031-002564-00 2016-04-06 #             
5 32591              3   B3031-002562-00 2016-04-07 #             
6 32591              7.5 B3031-002562-00 2016-04-08 #             
7 32605              0   B3031-002348-00 2016-01-04 #             
8 32605              3   B3031-002419-00 2016-01-04 #             
9 32605              0   B3031-002348-00 2016-01-05 #             
10 32605              3   B3031-002419-00 2016-01-05 #   
", header = TRUE, comment.char = "")
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you! That worked, but how can I project the rest of the columns with the data? JobLocation is not showing up, only EmployeeID,JobID,TotalHoursWorked,MaxDate, and MinDate. – Veronica Mar 28 '19 at 21:10
  • @Veronica That is due to `summarise`, it keeps only the columns the data is grouped by and the columns it computes. See if [this](https://stackoverflow.com/questions/46553514/applying-group-by-and-summarisesum-but-keep-columns-with-non-relevant-conflict) helps. – Rui Barradas Mar 28 '19 at 22:28