Fill sequence by factor

Question

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.

Country Year Count
A       1    1
A       2    1
A       4    2
B       1    1
B       3    1

So I end up with

Country Year Count
A       1    1
A       2    1
A       3    0
A       4    2
B       1    1
B       2    0
B       3    1

Hope that's clear guys, thanks in advance!

Sorry for closing; reopened now. I guess this is different enough from the usual expand.grid solutions. Here for reference: https://stackoverflow.com/questions/9996452/r-find-and-add-missing-non-existing-rows-in-time-related-data-frame and https://stackoverflow.com/questions/10954602/handling-missing-combinations-of-factors-in-r — Frank, Aug 18 '17 at 19:04

Lamia · Answer 1 · 2017-08-18T19:49:11.253

6

This is a dplyr/tidyr solution using complete and full_seq:

library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
  Country  Year Count
    <chr> <dbl> <dbl>
1       A     1     1
2       A     2     1
3       A     3     0
4       A     4     2
5       B     1     1
6       B     2     0
7       B     3     1

edited Aug 18 '17 at 19:49

answered Aug 18 '17 at 19:19

Lamia

3,845
1
12
19

pogibas · Accepted Answer · 2017-08-18T19:19:46.697

4

library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]

edited Aug 18 '17 at 19:19

answered Aug 18 '17 at 19:05

pogibas

27,303
19
84
117

score 4 · Answer 3 · answered Aug 18 '17 at 19:12

Similar to @PoGibas' answer:

library(data.table)

# set default values
def = list(Count = 0L)

# create table with all levels    
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])

# initialize to defaults
fullDT[, names(def) := def ]

# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]

which gives

   Country Year Count
1:       A    1     1
2:       A    2     1
3:       A    3     0
4:       A    4     2
5:       B    1     1
6:       B    2     0
7:       B    3     1

This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".

Sotos · Answer 4 · 2017-08-19T11:39:19.830

4

Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.

d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
                       x <- rbind(i, data.frame(Country = i$Country[1], 
                                                 Year = setdiff(seq(max(i$Year)), i$Year), 
                                                 Count = 0)); 
                        x[with(x, order(Year)),]}), make.row.names = FALSE))

which gives,

     Country Year Count
  1       A    1     1
  2       A    2     1
  3       A    3     0
  4       A    4     2
  5       B    1     1
  6       B    2     0
  7       B    3     1

edited Aug 19 '17 at 11:39

answered Aug 18 '17 at 20:01

Sotos

51,121
6
32
66

instead of the `lapply(split(df,df$Country),function(i)..` you can use `by(df,df$Country,function(i)...` or `tapply` – Onyambu Aug 18 '17 at 20:41
@Onyambu true. I guess `lapply(split...))` comes to me more naturally. Thanks for the suggestions. – Sotos Aug 18 '17 at 20:59

score 2 · Answer 5 · answered Aug 18 '17 at 19:18

> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
   Country Year Count
1:       A    1     1
2:       A    2     1
3:       A    3    NA
4:       A    4     2
5:       B    1     1
6:       B    2    NA
7:       B    3     1

www · Answer 6 · 2017-08-18T19:31:38.603

2

Another dplyr and tidyr solution.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(Country) %>%
  do(data_frame(Country = unique(.$Country),
                Year = full_seq(.$Year, 1))) %>%
  full_join(dt, by = c("Country", "Year")) %>%
  replace_na(list(Count = 0))

edited Aug 18 '17 at 19:31

answered Aug 18 '17 at 19:20

www

38,575
12
48
84

lmo · Answer 7 · 2017-08-18T19:53:28.153

Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.

# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))

# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))

# merge onto original
mydf <- merge(dat, mydf, all=TRUE)

# fill in missing values
 mydf[is.na(mydf)] <- 0

This returns

mydf
  Country Year Count
1       A    1     1
2       A    2     1
3       A    3     0
4       A    4     2
5       B    1     1
6       B    2     0
7       B    3     1

Fill sequence by factor

7 Answers7