1

I am working with a large data set, lets call it data, and want to create a new column, lets call it data$results based off of some column data$input. The results are based off of some conditional if/then logic, so my original approach was something like:

for (rows in data) {
    data$results <- if(data$results == "1" | data$results== "2") {
        trueAnswer
    } else {
        falseAnswer
    }
}

With large data frames, this process might take several hours to run. However, if I subset the data into a data frame containing only entries where data$results is 1 or 2 and another where that is not true, I can just apply trueAnswer to one data frame and falseAnswer to another data frame. Then I can rbind the data frames back together. This approach only takes a couple minutes.

Why is the latter approach using subsetting so much more quicker? This is a case where this process is applied over many different data sets, so a the former method is too slow to be practical. I am just trying to understand what is causing the lack of efficiency in the first approach.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
niccalis
  • 134
  • 1
  • 7

2 Answers2

0

It is always advisable to provide a fully reproducible & minimal example with sample data. That way we can provide specific help based on your sample data.

In a lot of cases, explicit for loops can be avoided in R, and instead we can make use of optimised vectorised operations. For example ifelse is such a vectorised function.

Generally the dplyr syntax would be something like this:

library(dplyr);
library(magrittr);
data %>%
    mutate(results = ifelse(input == 1 | input == 2, "1 or 2", "Neither 1 nor 2"))

Update

To see how ifelse is vectorised, take a look at ?ifelse.

Value:

A vector of the same length and attributes (including dimensions and ‘"class"’) as ‘test’ and data values from the values of ‘yes’ or ‘no’. [...]

So in other words if ifelse evaluates 100 conditions, the return object will have length 100.

This may lead to the following perhaps surprising/unexpected results:

ifelse(c(TRUE), c(100, 200), c(300, 400))
#[1] 100

The return object is element 1 of c(100, 200) because the logical condition has length 1.

ifelse(c(TRUE, TRUE, TRUE), c(100, 200), c(300, 400))
#[1] 100 200 100

The return object has length 3 because the logical condition has length 3; since c(100, 200) only has two elements, R needs to recycle entries.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
0

R efficiency is designed around vectors, not loops. It is very rare (although it does happen) that a for or while loop is the best way to tackle a problem. In your case, you would do better to use the vectorized version of if/else: ifelse. It takes a vector of tests (eg. result %in% 1:2) and a 2 vectors of possible responses, depending on the test results. All of these have to be the same length. When you give a answer of length 1, it will extend it to the proper length, otherwise it gives an error. Here, it would look like this:

data$results <- ifelse(results %in% 1:2, trueAnswer, falseAnswer)
Melissa Key
  • 4,476
  • 12
  • 21
  • Great, thank you. You and Maurits essentially said the same thing. One piece I am not quite understanding is why ifelse is considered a vectorized operation. Presumable ifelse will have to go through row by row making logical comparisons, so how is it different than me explicitly looping through the data my self? – niccalis Apr 15 '18 at 23:24
  • Check out http://alyssafrazee.com/2014/01/29/vectorization.html - first google link that seems to address it. It's a good question, but (so far) I've just known *to* do it, not *why*. – Melissa Key Apr 15 '18 at 23:26
  • @niccalis Take a look at `?ifelse`; it explains that it returns a *vector* of the same length as the logical test condition. So in other words if you test 100 conditions, it will return a vector of length 100. Hence "vectorised". – Maurits Evers Apr 15 '18 at 23:28
  • Ok, the answer (from that site) turns out to be one I *should have* been able to answer on my own. R is an interpreted language, not compiled. Vectorized functions use loops implemented in a compiled language, ergo they are much faster. – Melissa Key Apr 15 '18 at 23:32
  • @niccalis I've added some further explanations on the vectorised nature of `ifelse` in my answer below. – Maurits Evers Apr 16 '18 at 01:05