3

I am selecting a subset of a data.frame g.raw, like this:

g.raw <- read.table(gfile,sep=',', header=F, row.names=1) 
snps = intersect(row.names(na.omit(csnp.raw)),row.names(na.omit(esnp.raw))) 
g = g.raw[snps,] 

It works. However, that last line is EXTREMELY slow.

g.raw is about 18M rows and snps is about 1M. I realize these are pretty large numbers, but this seems like a simple operation, and reading in g into a matrix/data.frame held in memory wasn't a problem (took a few minutes), whereas this operation I described above is taking hours.

How do I speed this up? All I want is to shrink g.raw a lot.

Thanks!

user1988705
  • 231
  • 1
  • 3
  • 5
  • I do not think that there is a faster way than your solution. –  Jan 17 '13 at 22:24
  • 2
    I could see how indexing via characters could potentially be slow, but I'm having a hard time wrapping my head around how this could be taking hours. Can you provide a more complete code example? – joran Jan 17 '13 at 22:29
  • 3
    Create a logical vector `ind <- rownames(g.raw) %in% snps` and subset using `ind`. Is that any faster? – joran Jan 17 '13 at 22:47
  • YES. it is about a thousand times faster! Makes sense, it doesn't have to do the character index lookup, as you said. – user1988705 Jan 17 '13 at 22:52
  • Huh, I still wouldn't have thought the character indexing would have been _that_ slow. Weird. – joran Jan 17 '13 at 22:53

2 Answers2

5

It seems to be the case where data.table can shine.

Reproducing data.frame:

set.seed(1)
N <- 1e6    # total number of rows
M <- 1e5    # number of rows to subset

g.raw <- data.frame(sample(1:N, N), sample(1:N, N), sample(1:N, N))
rownames(g.raw) <- sapply(1:N, function(x) paste(sample(letters, 50, replace=T), collapse=""))
snps <- sample(rownames(g.raw), M)

head(g.raw) # looking into newly created data.frame
head(snps)  # and rows for subsetting

data.frame approach:

system.time(g <- g.raw[snps,])
# >    user  system elapsed 
# > 881.039   0.388 884.821 

data.table approach:

require(data.table)
dt.raw <- as.data.table(g.raw, keep.rownames=T)
# rn is a column with rownames(g.raw)
system.time(setkey(dt.raw, rn))
# >  user  system elapsed 
# > 8.029   0.004   8.046 

system.time(dt <- dt.raw[snps,])
# >  user  system elapsed 
# > 0.428   0.000   0.429 

Well, 100x times faster with these N and M (and even better speed-up with larger N).

You can compare results:

head(g)
head(dt)
redmode
  • 4,821
  • 1
  • 25
  • 30
0

Pre-allocate, and use a matrix for building if the data is of uniform type. See iteratively constructed dataframe in R for a far more beautiful answer.

UPDATE

You were right - the bottleneck was in selection. The solution is to look up the numeric indexes of snps, once, and then just select those rows, like so:

g <- g.raw[match(snps, rownames(g.raw)),]

I'm an R newbie - thanks, this was an informative exercise. FWIW, I've seen comments by others that they never use rownames - probably because of things like this.

UPDATE 2

See also fast subsetting in R, which is more-or-less a duplicate. Most significantly, note the first answer, and the reference to Extract.data.frame, where we find out that rowname matching is partial, that there's a hashtable on rownames, and that the solution I suggested here turns out to be the canonical one. However, given all that, and experiments, I now don't see why it's so slow. The partial match algorithm should first look in the hash-table for an exact match, which in our case should always succeed.

Community
  • 1
  • 1
Ed Staub
  • 15,480
  • 3
  • 61
  • 91
  • I don't really understand how pre-allocating can help. It only takes a few minutes to load all the data (into memory from a file). My code is a = g[subset,] where g already exists. Seems like the slow part isn't allocating the memory, but selecting the subset, for some reason. – user1988705 Jan 17 '13 at 22:35