3

I would like to create new rows in a data.frame for all missing years for each group (firm and type). The dataframe looks as follows:

minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                  type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                  year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                  value = c(1,3,7,9,9,2,3,3,7,5,9,15)
                  )

Dataframe:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
B    X    2010     9
B    X    2008     9
B    X    2001     2
A    Y    2002     3
A    Y    2003     3
A    Y    2007     7
B    Y    2000     5
B    Y    2001     9
B    Y    2008    15

Now, what I want to get is the following: I can see in the data that the minimum year is 2000 and the maximum is 2010. I want to add a row for each missing year for each combination of firm-type. E.g. for firm A and type X, I would like to add rows such that it looks like this:

Final output:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
A    X    2001     1
A    X    2002     1
A    X    2003     1
A    X    2005     3
A    X    2006     3
A    X    2008     7
A    X    2009     7
A    X    2010     7

Additionally, I want to write the value from the previous year into the column 'value' for the missing row for all subsequent years until a new non-missing row appears (as seen in the final output example).

I have not yet come up with any useful code, but what I have found so far is the following which might be the right direction:

setDT(minimal)[, .SD[match(2000:2010, year)],
                           by = c("firm","type")]

I don't really understand the concept of setDT and .SD, but this creates at least one row for each firm type combination. However, there is not content for year.

Thanks a lot in advance!

Sotos
  • 51,121
  • 6
  • 32
  • 66
lorenzbr
  • 161
  • 11
  • I think there are dupes for this. Check for `?complete` from `tidyr` or `?expand.grid` from `base R` or `CJ` from `data.table` – akrun May 05 '17 at 07:35
  • Okay I came up with `min2 <- expand.grid(year = min(minimal$year):max(minimal$year), firm = unique(minimal$firm), type = unique(minimal$type))` and `merge(min2,minimal, by = c("firm","type","year"), all.x = T)`. Now I only need to add the correct values to each row which I still don'tk now how to do. – lorenzbr May 05 '17 at 08:21
  • try this: `library(dplyr); library(tidyr); minimal %>% group_by(firm, type) %>% complete(year = full_seq(year, 1)) %>% fill(value)` – Sotos May 05 '17 at 08:46
  • Cool this is really nice code. However, I still have the problem that it takes the minimum and maximum year for each group(firm,type). I actually need the total minimum and maximum which often differs from the group min and max. – lorenzbr May 05 '17 at 08:55
  • 1
    Okay, it is simply `year = full_seq(2000:2010,1)`. Thanks! – lorenzbr May 05 '17 at 09:01

3 Answers3

4

I could not find an exact dupe for this so here is a possible solution,

library(dplyr)
library(tidyr)

minimal %>% 
  group_by(firm, type) %>% 
  complete(year = full_seq(2000:2010, 1)) %>% 
  fill(value)
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

I wrote this code that do what you wanted, maybe it's not so efficient or elegant but it works:

# Input dataframe
minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                      type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                      year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                      value = c(1,3,7,9,9,2,3,3,7,5,9,15)
)

# Sorting is needed
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]

# Variables used
table = table(minimal$firm=="A", minimal$type=="X")
minYear = min(minimal$year)
maxYear = max(minimal$year)
startPos = 0

# Iterates the dataframe
for(i in 1:2){
  for(j in 1:2){
    prevValue = 0
    currYear = minYear

    # Adds minimum year if needed
    if(minimal$year[1+startPos] != currYear){
      newRow = c(as.character(minimal$firm[1+startPos]), as.character(minimal$type[1+startPos]), currYear, prevValue)
      minimal = rbind(minimal, newRow)
    }

    # Adds years
    for(k in (1+startPos):(table[i,j]+startPos)){
      if(minimal$year[k]!=currYear){
        currYear = currYear + 1
        while(minimal$year[k]!=currYear){
          newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear, prevValue)
          minimal = rbind(minimal, newRow)
          currYear = currYear + 1
        }
      }
      prevValue = minimal$value[k]
    }

    # Adds years from last to maximum
    if(currYear < maxYear){
      for(l in 1:(maxYear - currYear)){
        newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear+l, prevValue)
        minimal = rbind(minimal, newRow)
      }
    }
    startPos = startPos + table[i,j]

  }
}

# Result
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]
minimal
kennyFF92
  • 85
  • 9
0

Here is a data.table solution.

library(data.table)

dt <- setDT(minimal)[CJ(firm=firm, type=type, year=seq(min(year), max(year)), unique=TRUE),
              on=.(firm, type, year), roll=TRUE]

This returns

head(dt, 15)
    firm type year value
 1:    A    X 2000     1
 2:    A    X 2001     1
 3:    A    X 2002     1
 4:    A    X 2003     1
 5:    A    X 2004     3
 6:    A    X 2005     3
 7:    A    X 2006     3
 8:    A    X 2007     7
 9:    A    X 2008     7
10:    A    X 2009     7
11:    A    X 2010     7
12:    A    Y 2000    NA
13:    A    Y 2001    NA
14:    A    Y 2002     3
15:    A    Y 2003     3

Notice that the initial rows of the second firm-type combo are NA. If you want to fill these in with the subsequent year, you can adjust the argument of fill to "nearest", though this could effect the values in the middle of the data.

lmo
  • 37,904
  • 9
  • 56
  • 69