0

Let's say we have a database with two variables, a and b...

RawData <- data.frame( a = rnorm( 10 ), b = rnorm( 10 ) )

...and we want to define a new variable, c, that is the sum of a and b.

I can think of four ways to do this (at least in base R, without any libraries):

  1. RawData$c1 <- RawData$a + RawData$b or (see the comment of @alistaire and the answer of @42-) RawData[[ "c1" ]] <- RawData[[ "a" ]] + RawData[[ "b" ]]
  2. RawData <- transform( RawData, c2 = a + b )
  3. RawData <- within( RawData, { c3 = a + b } )
  4. RawData$c4 <- with( RawData, a + b )

Of course identical( RawData$c1, RawData$c2, RawData$c3, RawData$c4 ) is TRUE, so the question is whether there is any objective reason to prefer one over the others, or it is purely a matter of taste...?

Solution #1 is a bit superfluous as RawData is written three times, but it is still perhaps the easiest to type with auto-completion (e.g. in RStudio), especially if the variable names are long.

Tamas Ferenci
  • 424
  • 2
  • 10
  • 1
    `RawData$c <- with(RawData,a + b)` is what I would do. Having said that, this strikes me as too opinion-based to be a very useful question. – John Coleman Jun 25 '17 at 18:07
  • @JohnColeman : Thanks, I added this option! "this strikes me as too opinion-based to be a very useful question". Actually, this _is_ the question: whether it is completely opinion-based, or there are objective, rational aspects to decide this...? – Tamas Ferenci Jun 25 '17 at 18:16
  • All are fine, though I have a mild preference for 1 or 4. Really, most people are probably doing `library(dplyr); RawData <- RawData %>% mutate(c5 = a + b)` or `library(data.table); setDT(RawData)[, c6 := a + b][]`, though. – alistaire Jun 25 '17 at 18:17
  • @alistaire: Thanks for your comment! Yes, I know they're fine (in a sense that all work), but the question - sorry for repeating myself - is whether this is completely opinion-based, or there are objective, rational aspects to decide this...? Also, a problem with your suggested solutions, just like with #2, #3 and #4 from base R, is that you can't use the power of auto-completion! (That's what I call objective criteria, but I don't know whether there is anything else similar.) – Tamas Ferenci Jun 25 '17 at 18:22
  • @TamasFerenci Fair enough. I don't know enough about R internals to say if there is any objective difference. – John Coleman Jun 25 '17 at 18:22
  • 2
    There's effectively no difference. There is a tiny overhead to various lookup methods, but it will never be the the bottleneck in your code. There is some criticism of `$` in programmatic environments because it does partial matching, so `RawData[['a']] + RawData[['b']]` is safer for code where you don't control the variable names. And yes, autocompletion works for dplyr and data.table (in RStudio, at least). – alistaire Jun 25 '17 at 18:38
  • @alistaire: "There's effectively no difference." That's what I thought, thanks for confirming! "And yes, autocompletion works for dplyr and data.table (in RStudio, at least)." Wow, I didn't know that! If you call the second variable `bcdef` then hitting Tab at `RawData[ , c := a + bc` indeed completes it. Pretty interesting it doesn't work for the base R calls (hitting Tab at `transform( RawData, c = a + bc` does nothing). I don't know its reason though, is it RStudio-related...? Thanks again! – Tamas Ferenci Jun 25 '17 at 18:46
  • [Completion is an RStudio thing](https://support.rstudio.com/hc/en-us/articles/205273297-Code-Completion), though other editors and IDEs have other versions. – alistaire Jun 25 '17 at 18:52
  • @JohnColeman: I have attempted to construct an argument that shows there might be more than mere preference in answering a "coding style" question on the basis of a) potential barriers to flexibilty in one (or in this case _all_) of the suggested "styles", of b) potential errors when the language is used for programming. – IRTFM Jun 25 '17 at 19:35
  • I'm voting to reopen the question since I think that OP adequately distinguished the question from one which is primarily opinion based. Furthermore, the two answers that have been given are fairly useful. – John Coleman Jun 26 '17 at 19:21
  • @alistaire : You could rework this - effectively no difference (but see the answer of sconfluentus ), and how auto-completion works in RStudio - into a full-scale answer, these are very relevant, and worth more than a comment (I think). – Tamas Ferenci Jun 28 '17 at 10:30

2 Answers2

1

I agree with @alistaire that there will be little difference when interacting with the console, but there is a difference when putting such code inside programs and in that instance it is his suggestion to use "[[" that should be understood and I would argue preferred over any of the 4 methods cited. The reason: You can substitute a name to be evaluated with "[[" and that does not succeed with the use of "$" or with the other methods. Example code:

 my_name1 <- "a"
 my_name2 <- "b"

> RawData$c1 <- RawData$my_name1 + RawData$my_name2   # Fails
Error in `$<-.data.frame`(`*tmp*`, c1, value = integer(0)) : 
  replacement has 0 rows, data has 10                 # Success
> RawData$c1 <- RawData[[my_name1]] + RawData[[my_name2]]

You can also use "[[" to make the name of the new column a runtime specification, unlike the use of "$":

> my_new_name <-  "xyz"
>     RawData[[my_new_name]] <- RawData[[my_name1]] + RawData[[my_name2]]

> names(RawData)
[1] "a"   "b"   "c1"  "xyz"

The other three have the same sort of deficiency:

> RawData$c1 <- with( RawData,  my_name1 + my_name2)
Error in my_name1 + my_name2 : non-numeric argument to binary operator

The lesson that needs to be taken is that "$" is merely a crippled version of "[[". The other lesson (which I have not demonstrated) is that all three of with, within and transform are only "certified-safe" for use at the console, and should not be used in programming, either. That is a more subtle lesson since the errors that might or might not result will not be immediately apparent. The other three all suffer from non-standard evaluation concerns that start arising when unquoted symbols start gettting passed around especially when they are not named in a distinctive manner as might occur when the programmer uses single letter tokens. See this highly appreciated SO answer that involves another commonly used function that uses non-standard evaluation: Why is [ better than subset?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks! I edited #1 to take this into account. (That's the only one that can be "saved" in this respect from the four examples, if I'm correct.) – Tamas Ferenci Jun 28 '17 at 10:28
1

From a pragmatic point of view, it really does not matter all that much, they all get the job done in the same way which you are using them. (although there are ways in which these might fail if used in a function or loop, but in a script as they are, they seem equal)

From a computational standpoint, they are slightly more or less efficient, which when data becomes big, becomes meaningful.

You can test this.

Because 10 rows is computationally insignificant, I extended your data.frame out a bit, as follows:

df<- cbind(a=rnorm(1000000), b= rnorm(1000000))
RawData<-data.frame(df)

Running each with system.time, you get the following:

 system.time(RawData$c1 <- RawData$a + RawData$b , gcFirst = TRUE)
   user  system elapsed 
  0.008   0.001   0.009 
 system.time(RawData <- transform( RawData, c2 = a + b ),gcFirst = TRUE)
   user  system elapsed 
  0.008   0.001   0.009 
 system.time(RawData <- within( RawData, { c3 = a + b } ),gcFirst = TRUE)
   user  system elapsed 
  0.010   0.005   0.014 
 system.time(RawData$c4 <- with( RawData, a + b ), gcFirst = TRUE)
   user  system elapsed 
  0.006   0.004   0.010 

Then I added another TWO zeros.

df<- cbind(a=rnorm(100000000), b= rnorm(100000000))
RawData<-data.frame(df)

Then reran the computations: AND WAITED A VERY LONG TIME...a very, very long time..I sent this series of tasks to work on a very fast machine before any answers were posted here this morning. Look at the elapsed time, the system time and the user time.

Clearly different methods have computational consequences when data gets large, and we are looking at simple tasks.

#The fastest method
system.time(RawData$c1 <- RawData$a + RawData$b , gcFirst = TRUE)
    user   system  elapsed 
   5.542  244.188 3271.741 
# The slowest method
system.time(RawData <- within( RawData, { c3 = a + b } ),gcFirst = TRUE)
    user   system  elapsed 
   9.031  207.036 3794.536 

These times are with all other applications closed, a clear environment and garbage collection between events!

Clearly how matters. The question becomes at what point do you worry about this sort of efficiency? The addition of two zeros takes the computation from fractional seconds to 54 and 63 minutes in elapsed time for each simple addition. Imagine if the math were more complex?

I would suspect if you took 42's advice using [] you could, improve performance even more....

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • +1 for giving data. You could also use the microbenchmark package: `microbenchmark(RawData$c1 <- RawData$a + RawData$b,RawData <- transform( RawData, c2 = a + b ),RawData <- within( RawData, { c3 = a + b } ),RawData$c4 <- with( RawData, a + b ))` for your first `df` confirms that there is very little practical difference, although for even smaller examples (e.g. 100 rows), the first and fourth emerge as clearly better than the others. Permuting the order of the expressions inside `microbenchmark` doesn't change the results, so it isn't a GC artifact. – John Coleman Jun 26 '17 at 15:08