2

I have huge amounts of data to analyze, I tend to leave space between words or variable names as I write my code, So the question is, incases where efficiency is the number 1 priority, does the white space have cost?

Is c<-a+b more efficient than c <- a + b

user2004820
  • 321
  • 2
  • 11

5 Answers5

8

To a first, second, third, ..., approximation, no, it won't cost you any time at all.

The extra time you spend pressing the space bar is orders of magnitude more costly than the cost at run time (and neither matter at all).

The much more significant cost will come from any any decreased readability that results from leaving out spaces, which can make code harder (for humans) to parse.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
5

In a word, no!

library(microbenchmark)

f1 <- function(x){
    j   <- rnorm( x , mean = 0 , sd = 1 )         ;
    k   <-      j    *      2         ;
    return(    k     )
}

f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)}


microbenchmark( f1(1e3) , f2(1e3) , times= 1e3 )
    Unit: microseconds
     expr     min       lq  median      uq      max neval
 f1(1000) 110.763 112.8430 113.554 114.319  677.996  1000
 f2(1000) 110.386 112.6755 113.416 114.151 5717.811  1000

#Even more runs and longer sampling
microbenchmark( f1(1e4) , f2(1e4) , times= 1e4 )
  Unit: milliseconds
      expr      min       lq   median       uq       max neval
 f1(10000) 1.060010 1.074880 1.079174 1.083414 66.791782 10000
 f2(10000) 1.058773 1.074186 1.078485 1.082866  7.491616 10000

EDIT

It seems like using microbenchmark would be unfair because the expressions are parsed before ever they are run in the loop. However using source should mean that with each iteration the sourced code must be parsed and whitespace removed. So I saved the functions to two seperate files, with the last line of the file being a call of the function, e.g.so my file f2.R looks like this:

f2 <- function(x){j<-rnorm(x,mean=0,sd=1);k<-j*2;return(k)};f2(1e3)

And I test them like so:

microbenchmark( eval(source("~/Desktop/f2.R")) ,  eval(source("~/Desktop/f1.R")) , times = 1e3)
  Unit: microseconds
                           expr     min       lq   median      uq       max neval
 eval(source("~/Desktop/f2.R")) 649.786 658.6225 663.6485 671.772  7025.662  1000
 eval(source("~/Desktop/f1.R")) 687.023 697.2890 702.2315 710.111 19014.116  1000

And a visual representation of the difference with 1e4 replications.... enter image description here

Maybe it does make a minuscule difference in the situation where functions are repeatedly parsed but this wouldn't happen in normal use cases.

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • lol, I started to make same thing with more simple functions (1+2), but I used 10000000 replications so you got results faster. :) (somewhat suprisingly rbenchmark gave results 53s and 44s.. in favor of white spaces) – Jouni Helske Mar 12 '13 at 17:47
  • Are you sure that's a fair test? My strong guess is that the function bodies are only being parsed once, initially, entirely outside of the microbenchmark loop! – Josh O'Brien Mar 12 '13 at 17:48
  • TBH - I don't know, but I get the feeling you are asking rhetorically? That is a really good point. – Simon O'Hanlon Mar 12 '13 at 17:55
  • 1
    Yeah, it is rhetorical, though I've got a strong suspicion as to the answer. I'd feel completely confident if `identical(body(f1), body(f2))` (or one of many variants of it that I've tried) were `TRUE`, but it's not and I can't figure out why. Will be happy to learn more. – Josh O'Brien Mar 12 '13 at 17:58
  • well in the case of the I posted above probably because actually they are different! I forgot to include the `;` separators in `f1`, but even including these I still return `FALSE`. – Simon O'Hanlon Mar 12 '13 at 18:08
  • 2
    Looks like they **are** identical (up to the various `srcref` attributes that are attached to the parsed function body). There must be a more elegant way to check this, but the following at least works: `a <- body(f1); b <- body(f2); attributes(a) <- attributes(b) <- NULL; identical(a,b)` – Josh O'Brien Mar 12 '13 at 18:10
  • I really like questions like this, I find I learn a lot. Thanks. I am still unsure at which point the function bodies are being parsed (lots more to learn). – Simon O'Hanlon Mar 12 '13 at 18:15
  • 1
    Me too. One hint that parsing takes place before the function is ever called is to try something like this: `j <- function (x) a <- b c <- d`. The error message you get comes from the parser, which is telling you that this is not a syntactically valid expression. – Josh O'Brien Mar 12 '13 at 18:18
  • @JoshO'Brien I wonder if saving the functions to seperate files and calling them from `microbenchmark()` using `source()` would effectively force R to parse the expressions with each loop? It doesn't make a difference to relative timings but i'll update my code to show what I mean... – Simon O'Hanlon Mar 12 '13 at 18:39
  • Here's a better way to test whether the two functions are identical: `oopts <- options(keep.source=FALSE)`, then source the two functions in, then `identical(f1, f2, ignore.environment=TRUE); options(oopts)`. (I believe `ignore.environment` is new in 2.15.3, or possibly just in R-devel.) – Josh O'Brien Mar 12 '13 at 20:33
5

YES

But, No, not really:

TL;DR It would probably take longer just to run your script to remove the whitespaces than the time it saved by removing them.

@Josh O'Brien really hit the nail on the head. But I juts couldnt resist to benchmark

As you can see, if you are dealing with an order of magnitude of 100 MILLION lines then you will see a miniscule hinderance. HOWEVER With that many lines, there would be a high likelihood of their being at least one (if not hundreds) of hotspots, where simply improving the code in one of these would give you much greater speed than greping out all the whitespace.

  library(microbenchmark)

  microbenchmark(LottaSpace = eval(LottaSpace), NoSpace = eval(NoSpace), NormalSpace = eval(NormalSpace), times=10e7)

  @ 100 times;  Unit: microseconds
           expr   min     lq median     uq    max
  1  LottaSpace 7.526 7.9185 8.1065 8.4655 54.850
  2 NormalSpace 7.504 7.9115 8.1465 8.5540 28.409
  3     NoSpace 7.544 7.8645 8.0565 8.3270 12.241

  @ 10,000 times;  Unit: microseconds    
           expr   min    lq median    uq      max
  1  LottaSpace 7.284 7.943  8.094 8.294 47888.24
  2 NormalSpace 7.182 7.925  8.078 8.276 46318.20
  3     NoSpace 7.246 7.921  8.073 8.271 48687.72

WHERE:

  LottaSpace <- quote({
        a            <-            3
        b                  <-                  4   
        c         <-      5
        for   (i            in      1:7)
              i         +            i
  })


  NoSpace <- quote({
  a<-3
  b<-4
  c<-5
  for(i in 1:7)
  i+i
  })

  NormalSpace <- quote({
   a <- 3
   b <- 4 
   c <- 5
   for (i in 1:7)
   i + i
  })
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
4

The only part this can affect is the parsing of the source code into tokens. I can't imagine that the difference in parsing time would be significant. However, you can eliminate this aspect by compiling the functions using the compile or cmpfun functions of the compiler package. Then the parsing is only done once and any whitespace difference can not affect execution time.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • 1
    I'm pretty sure a function definition is only *parsed* once, whether or not you `compile` the code. The `compiler` package speeds code up by avoiding repeated *compilation* of functions, right? – Josh O'Brien Mar 12 '13 at 18:27
  • @JoshO'Brien That might be right. I am not sure. If compiled, it won't be parsed again, but if not compiled, I am not sure if it will be parsed again or not. – Brian Diggs Mar 12 '13 at 18:31
1

There should be no difference in performance, although:

fn1<-function(a,b) c<-a+b
fn2<-function(a,b) c <- a + b

library(rbenchmark)

> benchmark(fn1(1,2),fn2(1,2),replications=10000000)
       test replications elapsed relative user.self sys.self user.child
1 fn1(1, 2)     10000000   53.87    1.212      53.4     0.37         NA
2 fn2(1, 2)     10000000   44.46    1.000      44.3     0.14         NA

same with microbenchmark:

Unit: nanoseconds
      expr min  lq median  uq      max neval
 fn1(1, 2)   0 467    467 468 90397803 1e+07
 fn2(1, 2)   0 467    467 468 85995868 1e+07

So the first result was bogus..

Jouni Helske
  • 6,427
  • 29
  • 52