Extend an irregular sequence and add zeros to missing values

Question

I have a data frame with a sequence in 'col1' and values in 'col2':

I want to expand the irregular sequence in 'col1' with a regular sequence from 1 to 13. For the values in 'col1' which are missing in the original data, I want 'col2' to have the value 0 in the final output:

How can I do this in R?

May I just say that I really like this thread for all the different ways people showed to tackle the problem? — LAP, Jan 12 '17 at 13:25
I think it could be worth linking this question (with heaps of very nice answers!) to a similar question based on grouped data: [Fastest way to add rows for missing values in a data.frame?](http://stackoverflow.com/questions/10438969/fastest-way-to-add-rows-for-missing-values-in-a-data-frame) — Henrik, Jan 12 '17 at 15:01

Axeman · Answer 1 · 2017-05-03T14:09:32.233

library(tidyr)

complete(d, col1 = 1:13, fill = list(col2 = 0))

or

complete(d, col1 = seq(max(col1))), fill = list(col2 = 0))

# A tibble: 13 × 2
    col1  col2
   <int> <dbl>
1      1  0.00
2      2  0.02
3      3  0.00
4      4  0.00
5      5  0.12
6      6  0.00
7      7  0.00
8      8  0.00
9      9  0.91
10    10  0.00
11    11  0.00
12    12  0.00
13    13  1.13

or

library(dplyr)

left_join(data.frame(col1 = seq(max(d$col1)))), d)

But this will leave NAs instead of zeros.

score 8 · Answer 2 · answered Jan 12 '17 at 13:18

Just for completeness, a self binary join using data.table (you will get NAs instead of zeroes, but that could be easily changed if needed)

library(data.table)
setDT(df)[.(seq(max(col1))), on = .(col1)]
#     col1 col2
#  1:    1   NA
#  2:    2 0.02
#  3:    3   NA
#  4:    4   NA
#  5:    5 0.12
#  6:    6   NA
#  7:    7   NA
#  8:    8   NA
#  9:    9 0.91
# 10:   10   NA
# 11:   11   NA
# 12:   12   NA
# 13:   13 1.13

jazzurro · Answer 3 · 2017-01-12T13:09:57.440

4

Another way would be the following. Your data is called mydf here. You create a data frame with a column including 1 to the max value of col1. Then, you use assign the values of col2 in mydf to a new column called col2 in foo. You use the numbers in col1 in mydf as index when you do this process. By this time, you have NA in col2 in foo. You want to change NA to 0. So the final step is to do this. You look for NA's position in col2 in foo using is.na() and assign zeros to the positions.

foo <- data.frame(col1 = 1:max(mydf$col1))
foo$col2[mydf$col1] <- mydf$col2
foo$col2[is.na(foo$col2)] <- 0

Taking lmo's idea into an account, you can create a data frame with 0 first and avoid the 3rd step.

foo <- data.frame(col1 = 1:max(mydf$col1), col2 = 0)
foo$col2[mydf$col1] <- mydf$col2


#   col1 col2
#1     1 0.00
#2     2 0.02
#3     3 0.00
#4     4 0.00
#5     5 0.12
#6     6 0.00
#7     7 0.00
#8     8 0.00
#9     9 0.91
#10   10 0.00
#11   11 0.00
#12   12 0.00
#13   13 1.13

DATA

mydf <- structure(list(col1 = c(2L, 5L, 9L, 13L), col2 = c(0.02, 0.12, 
0.91, 1.13)), .Names = c("col1", "col2"), class = "data.frame", row.names = c(NA, 
-4L))

edited Jan 12 '17 at 13:09

answered Jan 12 '17 at 13:05

jazzurro

23,179
35
66
76

2

or `foo <- data.frame(col1 = 1:max(mydf$col1), col2=0)` in first line. So you don't have to do the third line. – lmo Jan 12 '17 at 13:06
@Imo Yeah that is true. Let me revise my post. Thanks. – jazzurro Jan 12 '17 at 13:10
2

@jazzurro I think your original solution generalizes better for more columns. For instance, if there are 10 columns in the original data set, you could do `foo[names(mydf)[-1]][mydf$col1] <- mydf[-1]` or something similar (untested). Then changing all zeroes at ones is pretty straightforward too. – David Arenburg Jan 12 '17 at 13:23
@DavidArenburg Thanks for that feed back. :) – jazzurro Jan 12 '17 at 13:29

nicola · Answer 4 · 2017-01-12T13:29:48.947

Just to add a different point of view, consider that what you have can be seen as a sparse vector, i.e. a vector whose only the non-zero values are defined. Sparse vectors are implemented by the Matrix package in R. If df is your initial data.frame, try:

require(Matrix)
data.frame(col1=seq_len(max(df$col1)),
      col2=as.vector(sparseVector(df$col2,df$col1,max(df$col1))))
#   col1 col2
#1     1 0.00
#2     2 0.02
#3     3 0.00
#4     4 0.00
#5     5 0.12
#6     6 0.00
#7     7 0.00
#8     8 0.00
#9     9 0.91
#10   10 0.00
#11   11 0.00
#12   12 0.00
#13   13 1.13

The same result in a one-liner base R:

data.frame(col1=seq_len(max(df$col1)),
   col2=`[<-`(numeric(max(df$col1)),df$col1,df$col2))

Sotos · Answer 5 · 2017-01-12T13:29:37.273

3

Here is a function that uses expandRows from splitstackshape package,

expand_seq <- function(x){
  x$new <- c(x$col1[1], diff(x$col1))
  new_df <- splitstackshape::expandRows(x, 'new')
  new_df$col1 <- seq(max(new_df$col1))
  new_df$col2[!new_df$col1 %in% x$col1] <- 0
  rownames(new_df) <- NULL
  return(new_df)
}

expand_seq(df)
#   col1 col2
#1     1 0.00
#2     2 0.02
#3     3 0.00
#4     4 0.00
#5     5 0.12
#6     6 0.00
#7     7 0.00
#8     8 0.00
#9     9 0.91
#10   10 0.00
#11   11 0.00
#12   12 0.00
#13   13 1.13

edited Jan 12 '17 at 13:29

answered Jan 12 '17 at 13:23

Sotos

51,121
6
32
66

1

A new function that you can propose for the package? :) – jazzurro Jan 12 '17 at 13:31
@jazzurro Sure. It would be really helpful :) – Sotos Jan 12 '17 at 13:35
You can ask Ananda if he is keen. – jazzurro Jan 12 '17 at 13:37

Ronak Shah · Answer 6 · 2017-01-12T13:36:32.063

There are already some interesting answers here.

Just to hop in, we can create a sequence of numbers from 1 to max(col1) and then get the respective value of col2 using match

col1 = seq(1, max(df$col1))
data.frame(col1, col2 = df$col2[match(col1, df$col1)])

#   col1 col2
#1     1   NA
#2     2 0.02
#3     3   NA
#4     4   NA
#5     5 0.12
#6     6   NA
#7     7   NA
#8     8   NA
#9     9 0.91
#10   10   NA
#11   11   NA
#12   12   NA 
#13   13 1.13

This will give NAs instead of 0. If we need 0's,

data.frame(col1,col2 = ifelse(is.na(match(col1,df$col1)), 0,
                                             df$col2[match(col1,df$col1)]))

#   col1 col2
#1     1 0.00
#2     2 0.02
#3     3 0.00
#4     4 0.00
#5     5 0.12
#6     6 0.00
#7     7 0.00
#8     8 0.00
#9     9 0.91
#10   10 0.00
#11   11 0.00
#12   12 0.00
#13   13 1.13

score 2 · Answer 7 · answered Jan 12 '17 at 13:37

We can use base R with merge and replace

transform(merge(data.frame(col1= 1:13), df, all.x=TRUE),
                      col2 = replace(col2, is.na(col2), 0))
#    col1 col2
#1     1 0.00
#2     2 0.02
#3     3 0.00  
#4     4 0.00
#5     5 0.12
#6     6 0.00
#7     7 0.00
#8     8 0.00
#9     9 0.91
#10   10 0.00
#11   11 0.00
#12   12 0.00
#13   13 1.13

aichao · Answer 8 · 2017-01-12T13:49:28.047

1

I didn't see a simple merge solution, so here is one:

res <- merge(data.frame(col1=1:max(df$col1)),df,by="col1",all.x=TRUE)
res$col2 <- ifelse(is.na(res$col2),0,res$col2)

The second line is replacing the NA's from the merge (left outer join) with zeros. As @Axeman points out, this can also be accomplished by:

res$col2[is.na(res$col2)] <- 0

The result is:

res
##   col1 col2
##1     1 0.00
##2     2 0.02
##3     3 0.00
##4     4 0.00
##5     5 0.12
##6     6 0.00
##7     7 0.00
##8     8 0.00
##9     9 0.91
##10   10 0.00
##11   11 0.00
##12   12 0.00
##13   13 1.13

edited Jan 12 '17 at 13:49

answered Jan 12 '17 at 13:31

aichao

7,375
3
16
18

3

Second line can be `res$col2[is.na(res$col2)] <- 0`. – Axeman Jan 12 '17 at 13:32

LAP · Answer 9 · 2017-01-12T13:28:02.163

0

Another way would be:

for (i in 1:max(test$col1)) {
  if(!(i %in% test$col1)) (test <- rbind(test, c(i, 0)))
}
test <- test[order(test$col1),]

Axeman's answer is really sweet, though.

Edit: Data used --

test <- structure(list(col1 = c(2, 5, 9, 13), col2 = c(0.02, 0.12, 0.91, 
1.13)), .Names = c("col1", "col2"), row.names = c(NA, -4L), class = "data.frame")

DISCLAIMER: This should really not be used for big datasets. I tried it with 1k rows and it was done in a heartbeat, but my second test with 100k rows is running for minutes now, which really emphasizes Axeman's concerns in his comment.

edited Jan 12 '17 at 13:28

answered Jan 12 '17 at 13:07

LAP

6,605
2
15
28

2

In case performance matters, a loop with a in which you grow an object is pretty much worst case scenario. A vectorized solution is much better. – Axeman Jan 12 '17 at 13:14
1

Yep, I agree with that. The OP stated `my data is`, therefore I did not really think about computing time. // Edit: For funs and giggles, I just ran the code with 100k rows... let's just say it is still running... – LAP Jan 12 '17 at 13:18

Extend an irregular sequence and add zeros to missing values

9 Answers9

Linked

Related