I am working with a large data set, lets call it data
, and want to create a new column, lets call it data$results
based off of some column data$input
. The results are based off of some conditional if/then logic, so my original approach was something like:
for (rows in data) {
data$results <- if(data$results == "1" | data$results== "2") {
trueAnswer
} else {
falseAnswer
}
}
With large data frames, this process might take several hours to run. However, if I subset the data into a data frame containing only entries where data$results is 1 or 2 and another where that is not true, I can just apply trueAnswer to one data frame and falseAnswer to another data frame. Then I can rbind the data frames back together. This approach only takes a couple minutes.
Why is the latter approach using subsetting so much more quicker? This is a case where this process is applied over many different data sets, so a the former method is too slow to be practical. I am just trying to understand what is causing the lack of efficiency in the first approach.