If your data file is very large and you only want the rows that match a certain criterion, package sqldf
can filter while it reads in the data.
Here is an example use case. I will create a binary column in built in data set iris
and write the new table to disk.
library(sqldf)
set.seed(1234)
iris1 <- iris
iris1$V4 <- rbinom(nrow(iris1), 1, 0.5)
write.table(iris1, "iris3.dat", sep = ",", quote = FALSE, row.names = FALSE)
Now read the data in and filter only the rows where V4 == 0
.
# set up file connection
iris3 <- file("iris3.dat")
df1 <- sqldf('select * from iris3 where "V4" = 0')
close(iris3)
Compare with the result of subset
.
df2 <- subset(iris1, V4 == 0)
row.names(df2) <- NULL
all.equal(df1, df2)
#[1] "Component “Species”: Modes: character, numeric"
#[2] "Component “Species”: Attributes: < target is NULL, current is list >"
#[3] "Component “Species”: target is character, current is factor"
Final clean up.
unlink("iris3.dat")
rm(iris1, df1, df2)