Filling in non-existing rows in R + dplyr

Question

Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.

My data look basically like this:

FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
  FiscalWeek Group Amount
  <fct>      <chr>  <dbl>
1 45         A          1
2 46         A          1
3 48         A          1
4 48         B          5
5 48         C          6

Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:

library(dplyr)
averages <- df %>%
  group_by(Group) %>%
  summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
  Group  Avgs
  <chr> <dbl>
1 A         1
2 B         5
3 C         6

But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:

DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
  DesiredGroup DesiredAvgs
  <chr>              <dbl>
1 A                   0.75
2 B                   1.25
3 C                   1.5

What is the best way to do this using dplyr?

[Proper idiom for adding zero count rows in tidyr/dplyr](https://stackoverflow.com/questions/25956178/proper-idiom-for-adding-zero-count-rows-in-tidyr-dplyr) — Henrik, Jul 08 '20 at 16:44
... the link posted by @Henrik contains the solution to your problem via tidyr::complete. However, to make complete work you have to set the levels of your factor week to include all desired weeks, e.g. `FiscalWeek <- factor(c(45, 46, 48, 48, 48), levels = 45:48)` — stefan, Jul 08 '20 at 16:54

score 1 · Answer 1 · answered Jul 08 '20 at 16:53

You can try this. I hope this helps.

library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
  group_by(Group) %>%
  summarize(Avgs = sum(Amount)/range)

# A tibble: 3 x 2
  Group  Avgs
  <chr> <dbl>
1 A      0.75
2 B      1.25
3 C      1.5

r2evans · Accepted Answer · 2020-07-08T18:49:50.003

Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.

The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:

all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
#    FiscalWeek Group
#    <fct>      <chr>
#  1 45         A    
#  2 45         B    
#  3 45         C    
#  4 46         A    
#  5 46         B    
#  6 46         C    
#  7 47         A    
#  8 47         B    
#  9 47         C    
# 10 48         A    
# 11 48         B    
# 12 48         C

From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).

full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
  mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
#    FiscalWeek Group Amount
#    <fct>      <chr>  <dbl>
#  1 45         A          1
#  2 46         A          1
#  3 48         A          1
#  4 48         B          5
#  5 48         C          6
#  6 45         B          0
#  7 45         C          0
#  8 46         B          0
#  9 46         C          0
# 10 47         A          0
# 11 47         B          0
# 12 47         C          0

full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
  mutate(Amount = coalesce(Amount, 0)) %>%
  group_by(Group) %>%
  summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
#   Group  Avgs
#   <chr> <dbl>
# 1 A      0.75
# 2 B      1.25
# 3 C      1.5

I appreciate everyone else's input, but this is the answer which worked best with my IRL data. — mmyoung77, Jul 08 '20 at 17:56
I should add that I DO know the difference between missing data and zero data, but the client in this case wants the missing data to be treated as zero. — mmyoung77, Jul 09 '20 at 17:42
No worries. In my experience, many clients (and perhaps not-forward-thinking analysts or programmers) don't understand or downplay the premise of replacing empty data with 0. There are times when doing so is correct or at least acceptable (and I do have clients who tell me that the reason it is appropriate is because they say so). — r2evans, Jul 09 '20 at 21:26

score 0 · Answer 3 · answered Jul 08 '20 at 16:44

0

You can do it without filling if you know number of weeks:

df %>%
  group_by(Group) %>%
  summarise(Avgs = sum(Amount) / length(45:48))

answered Jul 08 '20 at 16:44

det

5,013
1
8
16

My actual dataset extends over multiple years (with some weeks missing in each year). So `length(45:48)` does not help. – mmyoung77 Jul 08 '20 at 16:46
Yeah point is if you know number of weeks (and that number is same for all groups) you don't need to fill the data for calculating mean, otherwise check Henrik comment – det Jul 08 '20 at 16:50

Filling in non-existing rows in R + dplyr

3 Answers3