1

I have a data frame like below. Note that on day b, there is no entry for product 1. This indicates product 1 wasn't sold that day, but I want a row which makes this clear. That is, I want to add a row where day=b, product=1, and sales=0. I want to do this for each day-product pair that doesn't exist in the data frame, e.g. day c and product 3. How can I accomplish this?

  df <- data.frame(day=c(rep('a',3), rep('b',2), rep('c',2)), 
                   product = c(1:3, 2:3,1:2), 
                   sales = runif(7))
Kyle Weise
  • 869
  • 1
  • 8
  • 29
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38

2 Answers2

2

Option 1

Thanks to @Frank for the better solution, using tidyr:

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

Using this approach, you no longer need to worry about selecting product names, etc.

Which gives you:

  day product      sales
1   a       1 0.52042809
2   b       1 0.00000000
3   c       1 0.46373882
4   a       2 0.11155348
5   b       2 0.04937618
6   c       2 0.26433153
7   a       3 0.69100939
8   b       3 0.90596172
9   c       3 0.00000000

Option 2

You can do this using the tidyr package (and dplyr)

df %>% 
  spread(product, sales, fill = 0) %>% 
  gather(`1`:`3`, key = "product", value = "sales")

Which gives the same result

This works by using spread to create a wide data frame, with each product as its own column. The argument fill = 0 will cause all empty cells to be filled with a 0 (the default is NA).

Next, gather works to convert the 'wide' data frame back into the original 'long' data frame. The first argument is the columns of the products (in this case '1':'3'). We then set the key and value to the original column names.

I would suggestion option 1, but option 2 might still prove to have some use in certain circumstances.


Both options should work for all days you have at least one sale recorded. If there are missing days, I suggest you look into the package padr and then using the above tidyr to do the rest.

Community
  • 1
  • 1
Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35
  • Thanks. Here you specified the range of values for `product` manually. Is there a way to avoid this? In the real data there are hundreds of products, all defined by character values. – IceCreamToucan Sep 20 '17 at 20:21
  • Sure thing, you could do this by calling through column number in the `gather` function (in the above case, you would use `2:4`). You could also just list the first and last products as long as you use the semicolon (i.e. `product1:product100`). If you don't have other columns besides the three you have here, you could also use `2:ncol(.)` – Dave Gruenewald Sep 20 '17 at 20:27
  • 1
    I think it should instead be `tidyr::complete(df, day, product, fill = list(sales = 0))` . – Frank Sep 20 '17 at 20:45
1

If speed is a concern, a self join to fill in missing levels could be an option (see section 3.5.5 of Frank's Quick R Tutorial):

library(data.table)
setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
  is.na(sales), sales := 0.0][]
   day product      sales
1:   a       1 0.57406950
2:   a       2 0.04390324
3:   a       3 0.63809278
4:   b       1 0.00000000
5:   b       2 0.01203568
6:   b       3 0.61310815
7:   c       1 0.19049274
8:   c       2 0.61758172
9:   c       3 0.00000000

Benchmark

Create benchmark data of 1 million rows minus 10% missing = 0.9 M rows:

n_day <-  1e3L
n_prod <- 1e3L
n_rows <- n_day * n_prod
# how many rows to remove?
n_miss <- n_rows / 10L
set.seed(1L)
df <- expand.grid(day = 1:n_day, product = 1:n_prod)
df$sales <- runif(n_rows)
#remove rows
df <- df[-sample.int(n_rows, n_miss), ]
str(df)
'data.frame': 900000 obs. of  3 variables:
 $ day    : int  1 2 3 5 6 7 8 9 11 12 ...
 $ product: int  1 1 1 1 1 1 1 1 1 1 ...
 $ sales  : num  0.266 0.372 0.573 0.202 0.898 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  1000 1000
  .. ..- attr(*, "names")= chr  "day" "product"
  ..$ dimnames:List of 2
  .. ..$ day    : chr  "day=   1" "day=   2" "day=   3" "day=   4" ...
  .. ..$ product: chr  "product=   1" "product=   2" "product=   3" "product=   4" ...

Define check function:

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(as.data.frame(values[[1]]), as.data.frame(x))))
}

Run benchmarks:

library(data.table)
microbenchmark::microbenchmark(
  tidyr = tidyr::complete(df, day, product, fill = list(sales = 0)),
  dt = setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
    is.na(sales), sales := 0.0][],
  times = 3L,
  check = my_check
)
Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 tidyr 1253.3395 1258.0595 1323.5438 1262.7794 1358.6459 1454.5124     3   b
    dt   94.4451  100.2952  155.4575  106.1452  185.9638  265.7823     3  a

For the given problem size of 1 M rows minus 10% missing, the tidyr solution is a magnitude slower than the data.table approach.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks for the great solution! You may want to add this as an answer to https://stackoverflow.com/questions/35131191/in-r-how-to-add-rows-with-0-counts-to-summarised-output since this question has been marked as a duplicate. – IceCreamToucan Sep 21 '17 at 14:56
  • @RobJensen Will do if time permits. Thanks for hinting. – Uwe Sep 21 '17 at 14:59