2

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.

Country Year Count
A       1    1
A       2    1
A       4    2
B       1    1
B       3    1

So I end up with

Country Year Count
A       1    1
A       2    1
A       3    0
A       4    2
B       1    1
B       2    0
B       3    1

Hope that's clear guys, thanks in advance!

Sotos
  • 51,121
  • 6
  • 32
  • 66
C_psy
  • 647
  • 8
  • 22
  • What's your question? – pogibas Aug 18 '17 at 18:58
  • What have your tried so far? – lmo Aug 18 '17 at 19:01
  • Sorry for closing; reopened now. I guess this is different enough from the usual expand.grid solutions. Here for reference: https://stackoverflow.com/questions/9996452/r-find-and-add-missing-non-existing-rows-in-time-related-data-frame and https://stackoverflow.com/questions/10954602/handling-missing-combinations-of-factors-in-r – Frank Aug 18 '17 at 19:04

7 Answers7

6

This is a dplyr/tidyr solution using complete and full_seq:

library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
  Country  Year Count
    <chr> <dbl> <dbl>
1       A     1     1
2       A     2     1
3       A     3     0
4       A     4     2
5       B     1     1
6       B     2     0
7       B     3     1
Lamia
  • 3,845
  • 1
  • 12
  • 19
4
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]

enter image description here

pogibas
  • 27,303
  • 19
  • 84
  • 117
4

Similar to @PoGibas' answer:

library(data.table)

# set default values
def = list(Count = 0L)

# create table with all levels    
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])

# initialize to defaults
fullDT[, names(def) := def ]

# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]

which gives

   Country Year Count
1:       A    1     1
2:       A    2     1
3:       A    3     0
4:       A    4     2
5:       B    1     1
6:       B    2     0
7:       B    3     1

This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".

Frank
  • 66,179
  • 8
  • 96
  • 180
4

Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.

d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
                       x <- rbind(i, data.frame(Country = i$Country[1], 
                                                 Year = setdiff(seq(max(i$Year)), i$Year), 
                                                 Count = 0)); 
                        x[with(x, order(Year)),]}), make.row.names = FALSE))

which gives,

     Country Year Count
  1       A    1     1
  2       A    2     1
  3       A    3     0
  4       A    4     2
  5       B    1     1
  6       B    2     0
  7       B    3     1
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • instead of the `lapply(split(df,df$Country),function(i)..` you can use `by(df,df$Country,function(i)...` or `tapply` – Onyambu Aug 18 '17 at 20:41
  • @Onyambu true. I guess `lapply(split...))` comes to me more naturally. Thanks for the suggestions. – Sotos Aug 18 '17 at 20:59
2
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
   Country Year Count
1:       A    1     1
2:       A    2     1
3:       A    3    NA
4:       A    4     2
5:       B    1     1
6:       B    2    NA
7:       B    3     1
Sagar
  • 2,778
  • 1
  • 8
  • 16
2

Another dplyr and tidyr solution.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(Country) %>%
  do(data_frame(Country = unique(.$Country),
                Year = full_seq(.$Year, 1))) %>%
  full_join(dt, by = c("Country", "Year")) %>%
  replace_na(list(Count = 0))
www
  • 38,575
  • 12
  • 48
  • 84
2

Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.

# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))

# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))

# merge onto original
mydf <- merge(dat, mydf, all=TRUE)

# fill in missing values
 mydf[is.na(mydf)] <- 0

This returns

mydf
  Country Year Count
1       A    1     1
2       A    2     1
3       A    3     0
4       A    4     2
5       B    1     1
6       B    2     0
7       B    3     1
lmo
  • 37,904
  • 9
  • 56
  • 69