0

My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)

I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...

3 Answers3

1

Since you're data is not that big, I suggest you use a simple loop

for (i in 1:nrow(mydata))
{
 for (j in 2:ncol(mydata)
  {

    mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
  }
 }
MFR
  • 2,049
  • 3
  • 29
  • 53
  • Thank you for the answer. But the dataset is actually very big and I am looking for a more vectorised / r way of doing this. Also in your solution wouldn't the first column data be replaced as well ? I need the first column to stay intact. – Ioannis Baltzakis Aug 03 '16 at 23:58
  • And it should be mydata[i,1] instead of mydata[1,j] in the end of the ifelse if I am not mistaken – Ioannis Baltzakis Aug 04 '16 at 00:06
  • Sorry for the mistake. It's mainly because of multitasking at this moment :) Hope by the new changes your second problem is solved. I agree that this is not the most efficient way to solve this problem. I'm interested in seeing others answers to see how they approach this problem. – MFR Aug 04 '16 at 00:12
1

Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:

df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})

Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:

  1. First find the row indices of the column that are zeroes using which.
  2. Set these rows in x to the corresponding values in the rows of the first column of df.
  3. Returns the column

Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.

See this for an excellent review of the *apply family of functions.

Hope this helps.

Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
  • Excellent. Thank you. Just out of curiosity, couldn't we use apply to do the same over each row instead of each column? – Ioannis Baltzakis Aug 04 '16 at 00:36
  • `apply` is for applying a function across some dimension of an array. See [this SO answer](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) for a good review of the `*apply` family of functions. – aichao Aug 04 '16 at 00:40
  • Seems like this doesn't do what I wanted but it's just a case of changing which==0 to which!=0. Remember I want to change all **nonzeros** to the first number of each row. Posting from my iPad so didn't try it yet – Ioannis Baltzakis Aug 04 '16 at 01:39
1

Suppose your data frame is dat, I have a fully-vectorized solution for you:

mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))

Example

set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
                    c("val", letters[1:5]))
#  val a b c d e
#1   1 1 0 0 1 1
#2   2 0 1 0 0 1
#3   3 0 1 0 1 0
#4   4 1 1 1 1 1
#5   5 1 1 0 0 0

My code above gives:

#  val a b c d e
#1   1 1 0 0 1 1
#2   2 0 2 0 0 2
#3   3 0 3 0 3 0
#4   4 4 4 4 4 4
#5   5 5 5 0 0 0

You want a benchmark?

set.seed(0)
n <- 2000  ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
                    c("val", paste0("x",1:n)))

## have to test my solution first, as aichao's solution overwrites `dat`

## my solution
system.time({mat <- as.matrix(dat[, -1])
            pos <- which(mat != 0)
            mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
            "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
#   user  system elapsed 
#  0.352   0.056   0.410 

## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
#   user  system elapsed 
#  7.804   0.108   7.919 

My solution is 20 times faster!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Didn't try to reproduce and understand your code yet but results are not what I want. I want non-zeros to be given the value of the first number in each row, your solution changes zeros to the first number – Ioannis Baltzakis Aug 04 '16 at 01:22
  • 1
    I accept a solution that is easy for me to understand and @aichao was kind enough to provide a thorough explanation for the workings of his code. To me as a beginner it is more important than having the absolute best performance, this is not a race to the end, but an exercise in learning more about r. – Ioannis Baltzakis Aug 04 '16 at 08:46