1

How can I achieve the same as below without using a for-loop?

df1 = data.frame( val = c("a", "c", "c", "b", "e") )  

m1 = matrix(0, nrow=nrow(df1), ncol=length( c("a", "b", "c", "d", "e") ) )
colnames(m1) = c("a", "b", "c", "d", "e")

for(i in 1:nrow(df1)){
  m1[i, df1[i, 1] ] = 1  #For each entry in dataframe, mark the respective column as 1
}
giliev
  • 2,938
  • 4
  • 27
  • 47
Pubs
  • 35
  • 3

2 Answers2

4

This

f<-function(m1,df) {
  for(i in 1:nrow(df1))
    m1[i, df1[i, 1] ] = 1
  return(m1)
}

is equivalent to

g<-function(m1,df) {
  m1[cbind(seq_len(nrow(df)),df1[,1])]<-1
  return(m1)
}

The latter is faster for this particular example

> microbenchmark(f(m1,df1),g(m1,df1))
Unit: microseconds
       expr     min      lq      mean  median      uq     max neval cld
 f(m1, df1) 167.085 174.885 194.58999 185.969 200.132 342.379   100   b
 g(m1, df1)  20.116  22.990  27.12403  24.222  27.300 158.053   100  a 

Note, however,

  • both are utilizing the factor levels rather than character column names
  • you should code what is clearest rather than what is fastest unless and until you identify a true bottleneck
A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • 1
    Note: `identical(f(m1,df1), g(m1,df1)) [1] TRUE` – Matthew Lundberg Sep 11 '15 at 18:22
  • +1 for "you should code what is clearest rather than what is fastest unless and until you identify a true bottleneck" – blep Sep 11 '15 at 18:24
  • 1
    Knuth - "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." – Shawn Mehan Sep 11 '15 at 18:27
  • To clarify that first bullet point, the behavior of both alternatives is the same. Neither the original `for` loop nor the alternative set the column label named "e" because the levels of `df[,1]` here skip "d". – A. Webb Sep 11 '15 at 18:38
0

You have several strange things in your code. First df1 is not needed at all, because data.frame is not supposed to store one dimensional vector. val = c("a", "c", "c", "b", "e") is enough. Also, as others suggested, there are more compact (and some more efficient) ways to achieve the same thing. However, if in your actual problem you work with much greater amount of data and you find it easier to use for loops, then you should consider using C++ code (and its for which is much faster).

Here is a benchmarking I did to compare the R and C++ fors, by creating a function which will add first n numbers (I did the test for n = 100K).

Here is the code:

library(Rcpp)
library(rbenchmark)

cppFunction(
  'int cppSum(int n) { 
    int s = 0;
    for(int i = 0; i <= n; i++) {
      s += i;
    }
    return s;
  }'
)

rSum <- function(n) {
  s = 0
  for (i in c(1:n)) {
    s = s + i
  }
  return(s)
}

n = 100000
benchmark(rSum(n), cppSum(n))

And here is the result:

       test replications elapsed relative user.self sys.self user.child sys.child
2 cppSum(n)          100   0.008     1.00      0.00        0          0         0
1   rSum(n)          100   2.790   348.75      2.79        0          0         0

You can notice in the relative column that R function is 348.75 times slower than the C++ function. In a computationally intensive processes it is a great optimization to use C++ for looping. Once, I have been running a for inside some other loop. It would take forever to finish. When I changed the R for with C++ for it finished in couple of minutes.

[Edit] This example does not solve your actual problem. The original question looked for alternative to the slow R for loop, so I suggested you alternative faster for loop, that being the C++ for loop. The working example is not using your data, because it is too small for any benchmarking. Instead, I use loop with 100K iterations, so there could be visible the differences between the 2 different loops.

giliev
  • 2,938
  • 4
  • 27
  • 47
  • 2
    Does this actually answer the question? – Matthew Lundberg Sep 11 '15 at 22:07
  • 1
    If I remember well, the original question was about alternative to the slow R for loop. In the meantime there have been another edit, so here I posted an answer to the original question. – giliev Sep 12 '15 at 07:02
  • 1
    And I was giving a general example how rcpp is used and a comparison of the R and C++ performance for on a simple example. I do not see a reason how this is not a relevant answer. What should I do, copy paste his example and work on it? In my opinion it is good to share some general experience and practices (giving some guidance) than solving exactly OP's current problem. That way he will get a more general knowledge instead of copy pasting the posted solution. – giliev Sep 12 '15 at 07:09
  • 1
    There are better questions for such general answers. For example http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r or http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r. – Matthew Lundberg Sep 12 '15 at 13:34
  • Ok. You are right. However, that was the question when I answered it. In the meantime it was modified drastically and IMO that is not a good edit in SO. I was spending my time, creating a working example and now my answer does not answer the current question. – giliev Sep 12 '15 at 13:41