Welcome to SO.
I'd suggest looking at the dplyr
and data.table
packages, which are focused on fast and memory efficient implementations. Especially i'd suggest the amazing answers to this question, which will give an good understanding of what these two packages are capable of.
data.table
does tend to outperform dplyr
as the number of groups and repeated subsets grow, as it utilizes indexed and keyed subsets, but for most it comes down to a preference. Focusing on subsetting I'll provide a reproducible example and some speed comparisons.
Reproducible example
set.seed(1)
df <- data.frame(group = sample(LETTERS, 1e7, TRUE),
random_numbers = rnorm(1e7),
random_binaries = rbinom(1e7, 1, 0.3))
# size = 152.6 MiB
format(object.size(df), units = "MiB")
Methods:
Base-R methods
Now in base-R subsetting can be performed in a myriad ways, one is the one you have shown yourself. df[df == ..]
. An alternative is to use the subset
function, however this a utility function, and is focused on readability rather than speed, and will usually perform worse. An example of their use is given below. However one may use the which
function, to convert a logical vector into indices, and doing this may improve performance.
df[df$group == "C",]
#Equivalent
df[which(df$group == "C"),]
#Equivalent
subset(df, group == "C")
dplyr methods
An alternative is the dplyr package. The dplyr is syntax sugar, giving piping options not unlike a few other packages (for example the magrittr
package), but different benchmarks (shown in the first link) show that this package can be used to improve performance on various aspects. However i am no expert on this package, as i tend to use the data.table
package. The package provides the %>%
piping function and some utility functions such as filter
which can be used for subsetting data
library(dplyr)
df %>% filter(group == "C")
# subsetting two columns
df %>% filter(group == "C", random_binaries == TRUE) #Equivalent to group == "C" & random_binaries == TRUE
Data.table methods:
Last another popular package is the data.table
package. This package is designed for performance and memory efficiency, like dplyr
. The syntax is designed to be similar to SQL statements, (select, from, where, group by), but starting out the syntax can be a bit confusing. The package provides a new data.table
class, to be used rather than the data.frame
class, which is notoriously slow for subsetting.
However, one can almost completely ignore the syntax of the package, as the data.table
utilizes the data.frame
syntax in most cases, and can be used as a data.frame
in every circumstance.
library(data.table)
#Convert the data.frame to data.table
setDT(df)
The data.table has two standard methods: Using indices and using keys. Indices are used if one uses similar methods to the data.frame
methods:
df1 <- df[random_binaries == TRUE]
df2 <- df[group == "C"]
Indices has roughly the same speed on the first usage but will increase on performance on every subsequent use.
Keys are used to pre-sort the data.table
, which allows for smart subsetting. Setting the key does take some time, and has a slightly different syntax, but outperforms other methods (although indices are similar in speed)
#Set the key using either setkey, or setkeyv (multiple columns)
setkeyv(df, c("group", "random_binaries"))
#Subset on group
df[.("C")]
#subset on random_binaries
df[CJ(group, TRUE, unique = TRUE)]
df[.(unique(group), TRUE)]
# Subset on multiple conditions
df[.(c("C", "H"), c(TRUE, TRUE))]
The syntax may be confusing, but one can check out their useful wiki page, or the many stackoverflow posts (8968 as of today), which provide answers to most questions.
Performance comparison
I've checked the performance of the subsetting methods presented, which are visualized below. The visualization, shows the various methods for a subset of group == "C"
and group == "H" & random_binaries == TRUE"
using the methods illustrated. The x-axis indicates the run time in milliseconds, and the y-axis shows the methods.
The widths of the blobs indicates the range, while the size of the blot illustrates the density of times in a range.
From the visualization one can see, that for a dataset of 2 columns, subsetting on both 1 and 2 columns, the data.table
method using keys are much much faster (marked as data.table_.._keyed
), while using indexes slightly outperforms the remaining methods. Using subset
is slower than standard methods, and suprisingly for this illustration dplyr
is slower than base-R, however this is might be due to my inexperience with the package.
