10

I have a data frame that has 2 columns.

column1 has random numbers in column2 is a place holding column for what i want column3 to look like

  random    temp
0.502423373 1
0.687594055 0
0.741883739 0
0.445364032 0
0.50626137  0.5
0.516364981 0
...

I want to fill column3 so it takes the last non-zero number (1 or .5 in this example) and continuously fills the following rows with that value until it hits a row with a different number. then it repeats the process for the entire column.

random     temp state
0.502423373 1   1
0.687594055 0   1
0.741883739 0   1
0.445364032 0   1
0.50626137  0.5 0.5
0.516364981 0   0.5
0.807804708 0   0.5
0.247948445 0   0.5
0.46573337  0   0.5
0.103705154 0   0.5
0.079625868 1   1
0.938928944 0   1
0.677713019 0   1
0.112231619 0   1
0.165907178 0   1
0.836195267 0   1
0.387712998 1   1
0.147737077 0   1
0.439281543 0.5 0.5
0.089013503 0   0.5
0.84174743  0   0.5
0.931738707 0   0.5
0.807955172 1   1

thanks for any and all help

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
user2813055
  • 283
  • 4
  • 13

7 Answers7

12

Perhaps you can make use of na.locf from the "zoo" package after setting values of "0" to NA. Assuming your data.frame is called "mydf":

mydf$state <- mydf$temp
mydf$state[mydf$state == 0] <- NA

library(zoo)
mydf$state <- na.locf(mydf$state)
#      random temp state
# 1 0.5024234  1.0   1.0
# 2 0.6875941  0.0   1.0
# 3 0.7418837  0.0   1.0
# 4 0.4453640  0.0   1.0
# 5 0.5062614  0.5   0.5
# 6 0.5163650  0.0   0.5

If there were NA values in your original data.frame in the "temp" column, and you wanted to keep them as NA in the newly generated "state" column too, that's easy to take care of. Just add one more line to reintroduce the NA values:

mydf$state[is.na(mydf$temp)] <- NA
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I think this would be bad if there are already NAs in the data. But if it works that's good too. – Neal Fultz Dec 06 '13 at 07:19
  • @NealFultz, and that comment warrants a down-vote? It's pretty easy to address your concern about the comment. (I'm presuming that you would want the value in the generated "state" variable to be `NA` if it was `NA` in the "temp" variable. Notice that I don't touch the "temp" variable, so I still have easy access to that information.) – A5C1D2H2I1M1N2O1R2T1 Dec 06 '13 at 07:32
  • And if you have NAs next to 0s? – Neal Fultz Dec 06 '13 at 07:58
  • 3
    @NealFultz, ??? How should I know. It's not my data and these conditions are not specified in the question. I would still guess that a `NA` next to a zero should be replaced with the last known value, and with the current data set, I don't see that this would be a problem. Or do you want to continue filling the data with `NA` when an `NA` is encountered? Please feel free to share the condition you perceive and how you propose dealing with it. I don't see that your present solution handles `NA` values, so I am eager to learn. – A5C1D2H2I1M1N2O1R2T1 Dec 06 '13 at 08:17
  • 1
    Just to clarify, there are no NAs, so this solution did the trick! – user2813055 Dec 06 '13 at 16:01
5

Inspired by the solution of @Ananda Mahto, this is an adaption of the internal code of na.locf that works directly with 0's instead of NAs. Then you don't need the zoo package and you don't need to do the preprocessing of changing the values to NA. Benchmarktests show that this is about 10 times faster than the original version.

locf.0 <- function(x) {
  L <- x!=0
  idx <- c(0, which(L))[cumsum(L) + 1]
  return(x[idx])
} 
mydf$state <- locf.0(mydf$temp)
shadow
  • 21,823
  • 4
  • 63
  • 77
3

Here is an interesting way with the Reduce function.

temp = c(1,0,0,0,.5,0,0,0,0,0,1,0,0,0,0,0,1,0,0.5,0,0,0,1)
fill_zero = function(x,y) if(y==0) x else y
state = Reduce(fill_zero, temp, accumulate=TRUE)

If you're worried about speed, you can try Rcpp.

library(Rcpp)
cppFunction('
  NumericVector fill_zeros( NumericVector x ) {
    for( int i=1; i<x.size(); i++ )
     if( x[i]==0 ) x[i] = x[i-1];
    return x;
  }
')
state = fill_zeros(temp)
kdauria
  • 6,300
  • 4
  • 34
  • 53
3

Also, unless I'm overlooking something, this seems to work:

DF$state2 <- ave(DF$temp, cumsum(DF$temp), FUN = function(x) x[x != 0])
DF
#       random temp state state2
#1  0.50242337  1.0   1.0    1.0
#2  0.68759406  0.0   1.0    1.0
#3  0.74188374  0.0   1.0    1.0
#4  0.44536403  0.0   1.0    1.0
#5  0.50626137  0.5   0.5    0.5
#6  0.51636498  0.0   0.5    0.5
#7  0.80780471  0.0   0.5    0.5
#8  0.24794844  0.0   0.5    0.5
#9  0.46573337  0.0   0.5    0.5
#10 0.10370515  0.0   0.5    0.5
#11 0.07962587  1.0   1.0    1.0
#12 0.93892894  0.0   1.0    1.0
#13 0.67771302  0.0   1.0    1.0
#14 0.11223162  0.0   1.0    1.0
#15 0.16590718  0.0   1.0    1.0
#16 0.83619527  0.0   1.0    1.0
#17 0.38771300  1.0   1.0    1.0
#18 0.14773708  0.0   1.0    1.0
#19 0.43928154  0.5   0.5    0.5
#20 0.08901350  0.0   0.5    0.5
#21 0.84174743  0.0   0.5    0.5
#22 0.93173871  0.0   0.5    0.5
#23 0.80795517  1.0   1.0    1.0
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • I think `ave(DF$temp, cumsum(DF$temp), FUN = sum)` should work as well. – kdauria Dec 08 '13 at 17:16
  • @Kevin: Yeah, you're right! In this case, `sum`ming the values works, too. And, perhaps, it is faster too, because it avoids turning to logical before indexing? Although, I'd still might prefer `x[x != 0]`, because it declares exactly what the `ave`raging function is. – alexis_laz Dec 08 '13 at 17:28
0

A loop along the following lines should do the trick for you -

for(i in seq(nrow(df)))
{
  if (df[i,"v1"] == 0) df[i,"v1"] <- df[i-1,"v1"]
}

Output -

> df
   v1 somedata
1   1       33
2   2       24
3   1       36
4   0       49
5   2       89
6   2       48
7   0        4
8   1       98
9   1       60
10  2       76
> 
> for(i in seq(nrow(df)))
+ {
+   if (df[i,"v1"] == 0) df[i,"v1"] <- df[i-1,"v1"]
+ }
> df
   v1 somedata
1   1       33
2   2       24
3   1       36
4   1       49
5   2       89
6   2       48
7   2        4
8   1       98
9   1       60
10  2       76
TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54
0

I suggest using the run length encoding functions, it's a natural way for dealing with steaks in a data set. Using @Kevin's example vector:

temp = c(1,0,0,0,.5,0,0,0,0,0,1,0,0,0,0,0,1,0,0.5,0,0,0,1)
y <- rle(temp)
#str(y)
#List of 2
# $ lengths: int [1:11] 1 3 1 5 1 5 1 1 1 3 ...
# $ values : num [1:11] 1 0 0.5 0 1 0 1 0 0.5 0 ...
# - attr(*, "class")= chr "rle"


for( i in seq(y$values)[-1] ) {
   if(y$values[i] == 0) {
      y$lengths[i-1] = y$lengths[i] + y$lengths[i-1]
      y$lengths[i] = 0
   }
}

#str(y)
#List of 2
# $ lengths: num [1:11] 4 0 6 0 6 0 2 0 4 0 ...
# $ values : num [1:11] 1 0 0.5 0 1 0 1 0 0.5 0 ...
# - attr(*, "class")= chr "rle"

inverse.rle(y)
#  [1] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
# [20] 0.5 0.5 0.5 1.0
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
-1

Simply use a loop with a global variable ,

globalvariable used here is m, r is a dataframe with two columns A and B.

r$B = c(1,NA, NA, NA, 3, NA,6)


m=1

for( i in 1:nrow(r) ){

  if(is.na(r$B[i])==FALSE ){

    m <<- i # please note the assign sign ,  " <<- "
    next()

  } else {

    r$B[i] = r$B[m]

  }

}

After Execution : r$B = 1 1 1 1 3 3 6

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • First off, this is a really bad and un-R-like way to achieve what OP is after. There are much *much* better (and vectorised) alternatives, see the other answers to this post. Secondly, the code you give is actually not reproducible. `r` is not defined anywhere, you mention `R` as a `data.frame` but R is case-sensitive. Using `<<-` in this context is precisely one of the examples for how *not* to use `<<-`: [The Evil and Wrong use is to modify variables in the global environment](https://stackoverflow.com/a/5785757/6530970). – Maurits Evers Feb 28 '19 at 04:03
  • [continued] Lastly, `next` is a [control flow statement](https://stat.ethz.ch/R-manual/R-patched/library/base/html/Control.html); `next` doesn't return a value, and it should be `next` instead of `next()`. I think this answer contributes little (if anything) to this post and therefore should be deleted as it promotes bad R coding practice. – Maurits Evers Feb 28 '19 at 07:28