There are two parts to your question, efficient calculation and processing large data.
Efficient calculation
Suppose you had a more manageable data set m
with 5% of 30 million rows and 50 columns (this takes about 30% of my 8Gb; running out of memory would make everything run slowly so you'll need to let us know about this type of information).
nrow <- .05 * 30000000
ncol <- 50
m <- matrix(rnorm(nrow * ncol), nrow)
Maybe you'd write a function clean
that efficiently removed the outliers on a per-row basis; it likely uses another function that efficiently calculates row-wise standard deviations
rowSD <- function(m) {
## efficiently calculate row-wise SD
## naive: apply(m, 1, sd, na.rm=TRUE)
## update via @BenBolker / http://stackoverflow.com/questions/16046820/change-row-values-to-zero-if-less-than-row-standard-deviation
sqrt(rowSums((m - rowMeans(m, na.rm=TRUE))^2, na.rm=TRUE) / (ncol(m)-1))
}
clean <- function(m) {
## efficiently implement your strategy for identifying outliers
m[abs(m - rowMeans(m)) > 3 * rowSD(m)] <- NA # fast enough
m
}
For the matrix m
the naive implementation of rowSD(m)
took about 56s, whereas the update from @BenBolker takes about 1.4 seconds; clean(sd)
takes about 5s. Both make multiple copies of and passes through the data, so far from ideal.
Large data
Think about processing your data in chunks of size nrow
. If you'd cleaned two chunks m1
, m2
you could combine them and keep the top values with
sd <- c(rowSD(m1), rowSD(m2))
## if sorted, sd[idx] would be the value that separate high and low
idx <- nrow(result) + nrow(m) - nrow
keep <- sd > sort.int(sd, partial=idx)[idx] # index correct, or off-by-one?
## replace smallest in m1 with largest in m2
m1[!head(keep, nrow(m1)),] <- m2[tail(keep, nrow(m2)),]
Since you're doing matrix operations, it sounds like your data are all numeric and scan
, reading files in chunks, is the appropriate input.
conn <- file("myfile", "r")
result <- matrix(0, nrow, ncol)
while (length(x <- scan(con, nmax = nrow * ncol))) {
m <- clean(matrix(x, nrow, ncol, byrow=TRUE))
sd <- c(rowSD(result), rowSD(m))
idx <- nrow(result) + nrow(m) - nrow
keep <- sd > sort.int(sd, partial=idx)[idx]
result[!head(keep, nrow),] <- m[tail(keep, nrow(m)),]
}
close(conn)
result
is then the desired collection of cleaned rows with highest standard deviation.