How to process a large file with 30M entries?

Question

First part of my question is, is there a faster way of calculation Standard deviation than

mySD = apply(myData,1,sd)

Second part of the question is how to remove outliers (3 SD away from the mean of each line) and recalculate the SD for each line.

Third part of the question is once i calculate the SD of each line, how to pick up the ones that are over certain threshold (as an example 0.05)?

My matrix has 30 millions roaw and 50 columns.

If there is a faster way than R (e.g., perl or matlab) i am also happy to try it.

...

I have a matrix with 30 million rows and 50 columns. For each line, I would like to remove the outliers and calculate the standard deviation (SD). So I will have 30 million SD. Then I would like to pick up those lines with the highest SD (top %5).

I tried R, but even loading the data into R is taking huge amount of time. I even saved the data as *RData. but still to slow and too much time consuming.

Is there a faster way of doing these things? either in r or perl or matlab?

"Faster way" doesn't mean much when we don't know what your way was or how fast it was.... — A5C1D2H2I1M1N2O1R2T1, Jul 14 '13 at 16:45
@flodel In this case outliers will be 3SD away from the mean of each line. — user1007742, Jul 14 '13 at 17:25

Martin Morgan · Answer 1 · 2013-07-14T21:57:01.060

There are two parts to your question, efficient calculation and processing large data.

Efficient calculation

Suppose you had a more manageable data set m with 5% of 30 million rows and 50 columns (this takes about 30% of my 8Gb; running out of memory would make everything run slowly so you'll need to let us know about this type of information).

nrow <- .05 * 30000000
ncol <- 50
m <- matrix(rnorm(nrow * ncol), nrow)

Maybe you'd write a function clean that efficiently removed the outliers on a per-row basis; it likely uses another function that efficiently calculates row-wise standard deviations

rowSD <- function(m) {
    ## efficiently calculate row-wise SD
    ## naive: apply(m, 1, sd, na.rm=TRUE)
    ## update via @BenBolker / http://stackoverflow.com/questions/16046820/change-row-values-to-zero-if-less-than-row-standard-deviation
    sqrt(rowSums((m - rowMeans(m, na.rm=TRUE))^2, na.rm=TRUE) / (ncol(m)-1))
}

clean <- function(m) {
    ## efficiently implement your strategy for identifying outliers
    m[abs(m - rowMeans(m)) > 3 * rowSD(m)] <- NA  # fast enough
    m
}

For the matrix m the naive implementation of rowSD(m) took about 56s, whereas the update from @BenBolker takes about 1.4 seconds; clean(sd) takes about 5s. Both make multiple copies of and passes through the data, so far from ideal.

Large data

Think about processing your data in chunks of size nrow. If you'd cleaned two chunks m1, m2 you could combine them and keep the top values with

sd <- c(rowSD(m1), rowSD(m2))
## if sorted, sd[idx] would be the value that separate high and low
idx <- nrow(result) + nrow(m) - nrow 
keep <- sd > sort.int(sd, partial=idx)[idx]  # index correct, or off-by-one?
## replace smallest in m1 with largest in m2
m1[!head(keep, nrow(m1)),] <- m2[tail(keep, nrow(m2)),]

Since you're doing matrix operations, it sounds like your data are all numeric and scan, reading files in chunks, is the appropriate input.

conn <- file("myfile", "r")
result <- matrix(0, nrow, ncol)
while (length(x <- scan(con, nmax = nrow * ncol))) {
    m <- clean(matrix(x, nrow, ncol, byrow=TRUE))
    sd <- c(rowSD(result), rowSD(m))
    idx <- nrow(result) + nrow(m) - nrow
    keep <- sd > sort.int(sd, partial=idx)[idx]
    result[!head(keep, nrow),] <- m[tail(keep, nrow(m)),]
}
close(conn)

result is then the desired collection of cleaned rows with highest standard deviation.

From http://stackoverflow.com/questions/16046820/change-row-values-to-zero-if-less-than-row-standard-deviation : `sdbyrow <- function(mat) sqrt(rowSums((mat-rowMeans(mat))^2)/(ncol(mat)-1) )` might speed up `rowSD` considerably ... on a 1e5 x 1e2 matrix the timings were 6.5 secs vs 1 second. — Ben Bolker, Jul 14 '13 at 19:55
@MartinMorgan Thanks a lot, I am trying this now. I will try this and let you know how it goes. — user1007742, Jul 15 '13 at 16:26
@MartinMorgan , the first column are the text and it is the headers. How can i change scan() so that it nows it is the header? — user1007742, Jul 15 '13 at 17:03
you can use `skip=1` in `scan()` (this doesn't seem to be directly related to your question??) (I'm guessing you mean the first *row*?) — Ben Bolker, Jul 15 '13 at 19:16
@BenBolker i am trying to do row wise calculations and each row has a header (The first column is headers). MartinMorgan suggested to use scan() to load the files faster. But scan() does not like that I have headers. — user1007742, Jul 15 '13 at 19:54
I figured that -- it's just that your question doesn't say anything about loading files. I think `scan(...,skip=1)` should work, although you will have to set your column names manually (with a bit more work you can use `readLines(...,n=1)` to read just the first line and `strsplit()` to extract the individual words) — Ben Bolker, Jul 15 '13 at 20:00
I though skip =1 will skip the first line rather than first column. But i do not want to skip them as i will need the names later when i do the filtering. read.table() has 'row.names=1', I am looking for something similar. — user1007742, Jul 15 '13 at 20:08
@user1007742 `scan` takes argument `what`, if you had 1 column character and 50 numeric, you'd say `xx = scan("file", what=c("", rep(list(0), 50)))` and bind together dropping first col as `do.call(c, xx[-1])` — Martin Morgan, Jul 16 '13 at 04:49

score 1 · Answer 2 · answered Jul 14 '13 at 17:01

1

library(bigmemory)
?read.big.matrix

For starters. Then look at biganalytics, bigtabulate, biglm, etc.

answered Jul 14 '13 at 17:01

Noah

1,404
8
12

How to process a large file with 30M entries?

2 Answers2

Efficient calculation

Large data