How to expand dataframe

Question

I have this kind of data:

df <- data.frame(year=c(1999,1999,1999,2000,2000,2001,2011,2011,2011,2011), class=c("A","B","C","A","C","A","B","C","D","E"), 
             n=c(10,20,30,12,15,40,50,55,60,5), occurs=c(0,1,3,4,2,0,0,11,12,2))



> df
   year class  n occurs
1  1999     A 10      0
2  1999     B 20      1
3  1999     C 30      3
4  2000     A 12      4
5  2000     C 15      2
6  2001     A 40      0
7  2011     B 50      0
8  2011     C 55     11
9  2011     D 60     12
10 2011     E  5      2

I would like to expand this data like this:

   year class  n occurs
1  1999     A  1      0
1  1999     A  2      0
1  1999     A  3      0
...
1  1999     A 10      0

2  1999     B  0     0
2  1999     B  1     0
2  1999     B  2     0
...  
2  1999     B 20     1
3  1999     C  1     1
3  1999     C  1     1
3  1999     C  1     0
3  1999     C  1     0
.. the rest of occurs is seq of zeros...because `n-occurs` = 27 zeros and seq of 3x `1`.

I want to expand the rows n times as indicated by column n and so that the occurs column expands to flag 0 or 1 n-times according to the number of occurs columns number so if column occurs has interger 5 and column n = 10 then there will be n rows (year and class the same) and flags occurs 5 times zero and 5 times number one.

EDIT: Please note the new sequence of occurs (seq only of 0 and 1) is based on n-occurs for number of 0s and number of 1 is determined by number occurs.

much faster in my experience is `do.call(data.tabe::CJ, df)`; `CJ` also has a `unique` argument, which it looks like you may want — MichaelChirico, Sep 05 '17 at 17:17
Very close to this post: https://stackoverflow.com/questions/2894775/replicate-each-row-of-data-frame-and-specify-the-number-of-replications-for-each — lmo, Sep 05 '17 at 18:24
There seems to be a typo in the final element of n and also you don't explain where you want the 0s and 1s of occurs. For example, should all 1s be at the end? or a random placement of 1s? — lmo, Sep 05 '17 at 18:26
You should clarify by posting a complete example (with corresponding desired output). I think I get it, but it's still unnecessarily opaque. — Frank, Sep 05 '17 at 19:18
@Frank: Thanks for looking into this. I'm underestimating ....I keep thinking the example is sufficient.... I expanded now, I hope this time is clear. — Maximilian, Sep 05 '17 at 19:29

Parfait · Accepted Answer · 2017-09-05T19:51:48.447

2

Consider do.call and lapply calls using the data.frame() constructor with build of occurs:

df_List <- lapply(seq(nrow(df)), FUN=function(d){
    occ <- c(rep(1, df$occurs[[d]]), rep(0, df$n[[d]]-df$occurs[[d]]))

    data.frame(year=df$year[[d]], class=df$class[[d]], n=seq(df$n[[d]]), occurs=occ)
})

finaldf <- do.call(rbind, df_List)
head(finaldf, 20)
#    year class  n occurs
# 1  1999     A  1      0
# 2  1999     A  2      0
# 3  1999     A  3      0
# 4  1999     A  4      0
# 5  1999     A  5      0
# 6  1999     A  6      0
# 7  1999     A  7      0
# 8  1999     A  8      0
# 9  1999     A  9      0
# 10 1999     A 10      0
# 11 1999     B  1      1
# 12 1999     B  2      0
# 13 1999     B  3      0
# 14 1999     B  4      0
# 15 1999     B  5      0
# 16 1999     B  6      0
# 17 1999     B  7      0
# 18 1999     B  8      0
# 19 1999     B  9      0
# 20 1999     B 10      0

edited Sep 05 '17 at 19:51

answered Sep 05 '17 at 19:09

Parfait

104,375
17
94
125

This solution would work, but the `occurs` column should give only either `0` or `1` values. You have also `occurs=3` there. To sample number of `0s` and `1's is based on the `n` and `occurs` so number of `1` is determined by `occurs` and number of `0` is determined by `n-occurs`. – Maximilian Sep 05 '17 at 19:19
@Maximilian - see my answer to address this. – www Sep 05 '17 at 19:24
I'm exactly testing that but see the last row E with 5 n and 22 occurs. This leaves us a negative. – Parfait Sep 05 '17 at 19:32
You are abosolutely right! Sorry about that, I'm correcting that! – Maximilian Sep 05 '17 at 19:45

score 0 · Answer 2 · answered Sep 05 '17 at 19:05

Here is a base R method that is closely related to the linked post here and in my comment above. The answer is provides the method for generating the first two columns of the data.frame.

dat <- data.frame(df[1:2][rep(1:nrow(df), df$n),],
                  n=sequence(df$n),
                  occurs=unlist(mapply(function(x, y) rep(0:1, c(x-y, y)), df$n, df$occurs)))

Here, the first 2 columns are generated using that answer. n is generated using sequence, and occurs uses mapply and rep, returning a vector with unlist. This puts the 1s at the end. You could use 1:0 to put the 1s at the beginning or feed the resulting vector to sample within mapply to get a random ordering of 1s and 0s.

We can check that the data.frame has the proper number of rows:

nrow(dat) == sum(df$n)
[1] TRUE

The first 15 observations of

head(dat, 15)
    year class  n occurs
1   1999     A  1      0
1.1 1999     A  2      0
1.2 1999     A  3      0
1.3 1999     A  4      0
1.4 1999     A  5      0
1.5 1999     A  6      0
1.6 1999     A  7      0
1.7 1999     A  8      0
1.8 1999     A  9      0
1.9 1999     A 10      0
2   1999     B  1      0
2.1 1999     B  2      0
2.2 1999     B  3      0
2.3 1999     B  4      0
2.4 1999     B  5      0

How to expand dataframe

2 Answers2