fill the time gap in data frame in r

Question

I have a data set including the following info:

id     class     year     n
25     A63      2006      3
25     F16      2006      1
39     0901      2001     1
39     0903      2001     3
39     0903      2003     2
39     1901      2003     1
...

There are about 100k different ids and more than 300 classes. The year varies from 1998 to 2007.

What I want to do, is to fill the time gap, after some id and classes happened, with n=0 by id and class.

And then calculate the sum of n and the quantity of classes.

For example, the above 6 lines data should expand to the following table:

id     class     year     n  sum  Qc  Qs
25     A63      2006      3  3    2   2
25     F16      2006      1  1    2   2
25     A63      2007      0  3    0   2
25     F16      2007      0  1    0   2
39     0901      2001     1  1    2   2
39     0903      2001     3  3    2   2
39     0901      2002     0  1    0   2
39     0903      2002     0  3    0   2
39     0901      2003     0  1    2   3
39     0903      2003     2  5    2   3
39     1901      2003     1  1    2   3
39     0901      2004     0  1    0   3
39     0903      2004     0  5    0   3
39     1901      2004     0  1    0   3
...
39     0901      2007     0  1    0   3
39     0903      2007     0  5    0   3
39     1901      2007     0  1    0   3

I can solve it by the ugly for loop and it will takes one hour to get the result. Is there any better way to do that? Vectorize or using the data.table?

score 0 · Answer 1 · edited May 23 '17 at 12:24

0

Use expand.grid to get the cartesian product of class and year.

Then merge your current data frame to this new one. Then do the classic subset replacement.

df <- data.frame(class = as.factor(c("A63","F16","0901","0903","0903","1901")),
                year = c(2006,2006,2001,2001,2003,2003),
                n=c(3,1,1,3,2,1))

df2 <- expand.grid(class = levels(df$class),
                   year= 1997:2006)

df2 <- merge(df2,df, all.x=TRUE)
df2$n[is.na(df2$n)] <- 0

edited May 23 '17 at 12:24

Community

1
1

answered Feb 24 '15 at 20:11

vpipkt

1,710
14
17

I don't think I can do it. expand.grid will eat all my memory since I have 100k(ids)*300(classes)*10(years98-07)=300 million rows. And I don't need all of them. You can see that I only fill the time gap after some id and classes happened, not before. – Bayes Feb 24 '15 at 20:26
I can see in your example output, but better to spell it out explicitly in the text of the question. – vpipkt Feb 24 '15 at 20:47

NicE · Accepted Answer · 2015-02-24T22:21:00.810

0

Using dplyr you could try:

library(dplyr)
df%>% group_by(class,id) %>% arrange(year) %>%
    do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))

It groups the data by class and id, and merges each group to a dataframe containing all the years with the id and class of that group.

Edit: if you want to do this only after a certain id you could do:

  as.data.frame(rbind(df[df$id<=25,],df%>% filter(id>25) %>% group_by(class,id) %>% arrange(year) %>%
        do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))))

edited Feb 24 '15 at 22:21

answered Feb 24 '15 at 21:16

NicE

21,165
3
51
68

As I edit in the question, I don't need so much years here. I only fill the gap after some id and class happened. – Bayes Feb 24 '15 at 21:34
I edited so that the change is only for ids>25 and the fill in years start at 2001 but your question is a bit unclear... don't know which classes should be expanded and if they should, to which years – NicE Feb 24 '15 at 21:51
Thank you. Maybe I don't say it clearly. I want to expand the id-class to the years after it happened to 2007. For different id-class combinations, we need to calculate different start years. That's why I don't think your edit will solve my question, since I have so many id-class combinations. – Bayes Feb 24 '15 at 21:59
ah I think I get it know, for each id-class you want to expand to 2007, but keep the start date to the first date of that id-class combination. I updated again so that it calculates which year to start to for each id-class combination. – NicE Feb 24 '15 at 22:22

fill the time gap in data frame in r

2 Answers2