2

I’m trying to replace Us (or NAs, easy to lead in Us as NAs) in my dataframe with the values that are in the column above or below them. I.e.

0 1 0 1
U U U U
0 1 1 0

Would become

0 1 0 1    
0 1 U U
0 1 1 0

I have a for loop to do this, which works on subsets of the data

for(i in 2:((NROW(Sample_table))-1)) {
  for(j in 3:NCOL(Sample_table)) {
if((Sample_table[i,j]=="U")&(Sample_table[(i-1),j]==Sample_table[(i+1),j])){
  Sample_table[i,j] <- Sample_table[(i+1),j]
}
  }
}

(doesn't start at 1:1 because first couple of rows/columns contain positions/names). However, my final dataset is 152 columns and ~6 million rows, so the for loop isn’t a good solution (attempted to do this, ran for a week without finishing). I’ve tried using apply, but can’t work out how to get it to refer to other rows, I’ve tried using ifelse, but can only get it to work within a for loop. Any help or suggestions?

EDIT ###

I thought Maurits had solved it below, but when I apply it to a larger dataframe, it doesn't give the expected output:

df <- read.table(text =
               "0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
             ", header = F)
 > df
  V1 V2 V3 V4 V5 V6 V7 V8
1  0  1  0  1  0  1  1  0
2  U  U  U  U  1  0  1  1
3  0  1  1  0  0  1  0  1
4  0  1  0  1  0  1  1  0
5  U  U  U  U  1  0  1  1
6  0  1  1  0  0  1  0  1

> df2 <- as.data.frame(sapply(df, function(x) replace(x, x[1] == x[3] & x[2] 
== "U", x[1])))
> df2
  V1 V2 V3 V4 V5 V6 V7 V8
1  1  1  1  2  0  1  1  0
2  1  1  3  3  1  0  1  1
3  1  1  2  1  0  1  0  1
4  1  1  1  2  0  1  1  0
5  1  1  3  3  1  0  1  1
6  1  1  2  1  0  1  0  1
EDIT 2

Compared the methods: apply is quickest (that gets the right answer):

devtools::install_github("olafmersmann/microbenchmarkCore")
devtools::install_github("olafmersmann/microbenchmark")
library(microbenchmark)
mbm <- microbenchmark("apply_wrong_version" = {df <- read.table(text =
                                                  "0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
                 ", header = F)
df2 <- as.data.frame(sapply(df, function(x) replace(x, x[1] == x[3] & x[2] 
== "U", x[1])))
df2},"forloop" = {df <- read.table(text =
                                     "0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
                 ", header = F)
  for(i in 2:((NROW(df))-1)) {
    for(j in 1:NCOL(df)) {
      if((df[i,j]=="U")&(df[(i-1),j]==df[(i+1),j])){
        df[i,j] <- df[(i+1),j]
      }
    }
  }
},"na.locf_version" = {mat=read.table(text =
                                           "0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
                 ", header = F)
mat1=mat   
mat1[mat1=='U']=NA  
mask=zoo::na.locf(mat1)==zoo::na.locf(mat1,fromLast=T)
mat[mask]=zoo::na.locf(mat1,fromLast=T)[mask]
mat},"apply_version"= {df <- read.table(text =
                                          "0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
0 1 0 1 0 1 1 0
U U U U 1 0 1 1
0 1 1 0 0 1 0 1
                 ", header = F)
  df[]<-apply(df, 2, function(x){
    #find rows with U
    us<-which(x=="U" )
    #replace U with value above (if above=below)
    x[us]<-ifelse(x[us-1]==x[us+1], x[us-1], "U")
    return(x)
  })
})

mbm

                expr       min        lq       mean    median        uq       max neval  cld
 apply_wrong_version   671.605   821.334   979.1732   910.816  1020.840  4364.250   100 a   
             forloop 11809.985 13516.258 14523.5789 14059.863 15238.531 22556.858   100    d
     na.locf_version  3754.275  4380.448  5042.3309  4631.510  5314.573  9295.415   100   c 
       apply_version   986.470  1209.878  1476.4378  1321.878  1492.742  8167.513   100  b  
  • Interesting. As the number of rows grows in your sample table. I would expect the performance of the vectorized `ifelse` statement in the `apply` method to be 100-1000 times faster than the original for-loop. Glad you have a working solution. – Dave2e Sep 28 '18 at 20:01

5 Answers5

1

I assume that you only want to replace entries in the second row if entries in the first and third row are matching.

Perhaps something like this using replace?

# Sample data (as matrix)
mat <- as.matrix(read.table(text =
    "0 1 0 1
U U U U
0 1 1 0", header = F))    

apply(mat, 2, function(x) replace(x, x[1] == x[3] & x[2] == "U", x[1]))
#     V1  V2  V3  V4
#[1,] "0" "1" "0" "1"
#[2,] "0" "1" "U" "U"
#[3,] "0" "1" "1" "0"

Or if you're having a data.frame (instead of a matrix):

# Sample data (as data.frame)
df <- read.table(text =
    "0 1 0 1
U U U U
0 1 1 0", header = F)

as.data.frame(sapply(df, function(x) replace(x, x[1] == x[3] & x[2] == "U", x[1])))
#  V1 V2 V3 V4
#1  0  1  0  1
#2  0  1  U  U
#3  0  1  1  0
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Yes, this looks like what I was trying to do! It was the x[n] part that I couldn't work out! I will check and see if this gives the expected result in a subset of my real data. Thank you! – David Ashbrook Sep 28 '18 at 14:55
  • Ok: so this isn't doing quite what I want: – David Ashbrook Sep 28 '18 at 15:04
  • @DavidAshbrook Only saw your comment now (it was night time in Australia;-) yes, my solution doesn't immediately scale to larger `data.frame`s, and I thought the original 3-row `data.frame` was representative of your general issue. Anyway, glad you've got a working solution from Dave. Good on you for adding a benchmark analysis +1 – Maurits Evers Sep 29 '18 at 01:10
1

Here is a simple solution using just base R and the apply function. This solution also assumes the "U" is not in the first or last row. Also this assumes the data is stored in a dataframe.

df <- read.table(text =
           "0 1 0 1 0 1 1 0
            U U U U 1 0 1 1
            0 1 1 0 0 1 0 1
            0 1 0 1 0 1 1 0
            U U U U 1 0 1 1
            0 1 1 0 0 1 0 1", header = F)


df[]<-apply(df, 2, function(x){
  #find rows with U
  us<-which(x=="U" )
  #replace U with value above (if above=below)
  x[us]<-ifelse(x[us-1]==x[us+1], x[us-1], "U")
  return(x)
  })
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • This is the method I used, fastest and worked best. The only comment I will add was that it was necessary to add a row of 0s at the start and end: columns that begin with a U break the script otherwise. Thank you again for all your help. – David Ashbrook Oct 01 '18 at 01:45
0

In the zoo package there is a method called na.approx which will interpolate between the two values. There is also na.locf which takes the previous value. The two together can help you.

  • replace U with NAs
  • store the positions of all NAs
  • apply na.approx
  • apply na.locf
  • for those positions where the two are the same you keep the value
  • all others probably need to go back to U (or whatever you want to do in that case)

A related question can be found here: Interpolation of NAs

Ralfino
  • 39
  • 5
0

As Ral mentioned you can using zoo with na.locf

mat1=mat   
mat1[mat1=='U']=NA  
mask=zoo::na.locf(mat1)==zoo::na.locf(mat1,fromLast=T)
mat[mask]=zoo::na.locf(mat1,fromLast=T)[mask]

mat
     V1  V2  V3  V4 
[1,] "0" "1" "0" "1"
[2,] "0" "1" "U" "U"
[3,] "0" "1" "1" "0"
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Using dplyr lead() and lag()

myfunc <- function(my_list) {
  mlead <- lead(my_list, default = 'U')
  mlag <- lag(my_list, default = 'U')
  valuetocopy <- (my_list == 'U') & ((mlead == mlag))
  my_list[valuetocopy] <- mlead[valuetocopy]
  return(my_list)
}
M.Viking
  • 5,067
  • 4
  • 17
  • 33