1

I am working on an application that monitors worker productivity in R. For each worker I have the jobs they completed and I want to see how many jobs they completed each week of the year ( so weeks 1 - 52). The issue is that some workers don't have any work logged for certain weeks (e.g. Worker 1 might have worked weeks 1 - 48, but not 49 - 52).

The source data is tables for each job top where the observation is the worker, formatted like so:

UniqueID  Date     mapper
10000001 3/3/2015  person1

Starting with this I then break the date up into year, month, day fields and then use lubridate to get the weeks. This is applied to each source data file (of which there are 10).

I then group each table by the user and week and count the number of jobs each worker completed, see code here:


library(tidyverse)
library(readxl)
library(rlang)
library(pryr)
library(lubridate)
library(extrafont)
loadfonts(device = "win")


current_week <- week(Sys.Date())
feature_sheets <- excel_sheets("./typesMerged.xlsx")
list_sheets <- lapply(feature_sheets, function(x) read_excel(path = "./typesMerged.xlsx",col_types = c("numeric", "date", "text"), sheet = x))

users = list('person1', 'person2', 'person3', 'person4', 'person5', 'person6')


createTibble <- function(i, yr, wk){
  is_missing <- missing(wk)
  list_sheets[[i]]%>%
    filter(LASTUSER %in% users ) %>% 
    mutate(year = as.numeric(format(DATEMODIFI, format = "%Y")),
           month = as.numeric(format(DATEMODIFI, format = "%m")),
           week = week(DATEMODIFI),
           day = as.numeric(format(DATEMODIFI, format = "%d")))%>% 
    select(-DATEMODIFI) %>%
    filter(year == yr) %>%
    filter(if(is_missing) TRUE else week == wk) 
}




tableNames <- c('dpd', 'fuse', 'ohprimary', 'ohsecondary', 'poles', 'pv', 'switch','transformers', 'ugprimary', 'ugsecondary' )
features <- vector('list', length(tableNames))


for (feature in seq_along(features)){

  features[[feature]] <- createTibble(i = feature, yr = 2019)


}

# use for iteratively creating tibbles/df if needed
# for (feature in seq_along(features)){
#   assign(features[feature], value = createTibble(feature))
# }


countFeatures <- function(x){
  x %>%
    group_by(LASTUSER, week) %>%
    summarize(n = n())
}


featureCountTibbles <-map(.x = features,.f = countFeatures)

featureCountsByUser <- featureCountTibbles %>% reduce(full_join, by = c("LASTUSER", "week")) %>% 
  rename(mapper = LASTUSER) %>% 
  rename(dpd = n.x) %>%
  rename(fuse = n.y) %>% 
  rename(ohprimary = n.x.x) %>% 
  rename(ohsecondary = n.y.y) %>% 
  rename(poles = n.x.x.x) %>% 
  rename(pv = n.y.y.y) %>% 
  rename(switch = n.x.x.x.x) %>% 
  rename(transformers = n.y.y.y.y) %>% 
  rename(ugprimary = n.x.x.x.x.x) %>% 
  rename(ugsecondary = n.y.y.y.y.y) %>% 
  replace_na(replace = list(dpd = 0, fuse = 0, ohprimary = 0, ohsecondary = 0, poles = 0, pv = 0, switch = 0, transformers = 0, ugprimary = 0, ugsecondary = 0))

this results in a table that looks like this, the problem here is that person1 did no work in week 5 so he doesn't have a row for it:

mapper  week  dpd  fuse  etc.
person1  1    10    50   ...
person1  2     0    50   ...
person1  3    10     0   ...
person1  4    10    50   ...
person1  6    10    50   ...
person2  1    10    50   ...
person2  2    50    50   ...
person2  3    10     0   ...
person2  4    10    50   ...
person2  5    10    50   ***
person2  6    10    50   ...

This is how I want it to look where weeks of no work are seen:

mapper  week  dpd  fuse  etc.
person1  1    10    50   ...
person1  2     0    50   ...
person1  3    10     0   ...
person1  4    10    50   ...
person1  5     0     0   ***
person1  6    10    50   ...
person2  1    10    50   ...
person2  2    50    50   ...
person2  3    10     0   ...
person2  4    10    50   ...
person2  5    10    50   ...
person2  6    10    50   ...

How can I make it so these empty rows will be added? It sounds simple, but I can't really think of a way to do it in R (still kind of new to it). If this was SQL I would add a new field week at the beginning, populate it with 1 - 52 (which would add in all the rows I need) and then update the table with the actual data I have for the weeks.

Solution Implemented:

featureCountsByUser <- featureCountTibbles %>% reduce(full_join, by = c("LASTUSER", "week")) %>%
  rename(mapper = LASTUSER) %>%
  rename(dpd = n.x) %>%
  rename(fuse = n.y) %>% 
  rename(ohprimary = n.x.x) %>% 
  rename(ohsecondary = n.y.y) %>% 
  rename(poles = n.x.x.x) %>% 
  rename(pv = n.y.y.y) %>% 
  rename(switch = n.x.x.x.x) %>% 
  rename(transformers = n.y.y.y.y) %>% 
  rename(ugprimary = n.x.x.x.x.x) %>% 
  rename(ugsecondary = n.y.y.y.y.y) %>% 
  ungroup() %>% 
  complete(mapper,week) %>% 
  group_by(mapper,week) %>% 
  replace_na(replace = list(dpd = 0, fuse = 0, ohprimary = 0, ohsecondary = 0, poles = 0, pv = 0, switch = 0, transformers = 0, ugprimary = 0, ugsecondary = 0)) 

complete() was the answer; but contrary to the docs it did not respect my grouping and due to the nature of how I am counting jobs done I could not complete() on the raw data so this is my work around. Thanks all!

Karl Johnson
  • 111
  • 1
  • 7
  • 1
    Create a tibble with all the columns I need and rows blank, then feed in the results of this script into it? – Karl Johnson Dec 31 '19 at 16:28
  • 4
    It sound like you may want `tidyr::complete()`. You can see one example of its use [here](https://stackoverflow.com/a/37728065/2461552). If grouping at some point, the new `.drop = FALSE` feature may also help, shown in another answer at the same link. – aosmith Dec 31 '19 at 16:32

1 Answers1

1

Using the results data you posted that isn't what you want:

library(tidyverse)
x <- c("Worker  Week  dpd  fuse ", 
  "person1  1    10    5   ",
  "person1  2     0    5   ",
  "person1  3    10        ",
  "person1  4    10    5   ",
  "person1  6    10    5   ",
  "person2  1    10    5   ",
  "person2  2    50    5   ",
  "person2  3    10        ",
  "person2  4    10    5   ",
  "person2  5    10    5   ",
  "person2  6    10    5   ") %>%
  read_table()


x %>% complete(Worker, Week)

Should give:

# A tibble: 12 x 4
   Worker   Week   dpd  fuse
   <chr>   <dbl> <dbl> <dbl>
 1 person1     1    10     5
 2 person1     2     0     5
 3 person1     3    10    NA
 4 person1     4    10     5
 5 person1     5    NA    NA
 6 person1     6    10     5
 7 person2     1    10     5
 8 person2     2    50     5
 9 person2     3    10    NA
10 person2     4    10     5
11 person2     5    10     5
12 person2     6    10     5

complete() has a options for filling in missing data, link to reference above by @aosmith. Filling NA with 0 shouldn't be a problem.

mrhellmann
  • 5,069
  • 11
  • 38
  • 1
    Thanks, I couldn't get it to work exactly how you outlined, but it lead me to the solution (see update in original post). For whatever reason complete was not respecting my group_by() and I could not apply ```complete()``` on the raw data as the counts for completed jobs depends on the original data being as is (i.e. if I add weeks that aren't there it'll make it look like they completed jobs when they did not). – Karl Johnson Dec 31 '19 at 17:42