fast subsetting in R

Question

I've got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c("34", "36", "39"), c("45", "46"))

This says that dataframe rows with rownames (not numeric row indeces, but character rownames(dat)) "34", "36", "39" constitute one grouping, and "45", "46" constitute another grouping.

Now I want to pull out the groupings from the dataframe into a parallel list, but my code (below) is really, really slow. How can I speed it up?

> system.time(lapply(rows, function(r) {dat[r, ]}))
   user  system elapsed 
 246.09    0.01  247.23

That's on a very fast computer, R 2.14.1 x64.

Could you give roughly how many elements in `rows` and roughly how many elements in `rows[[i]]`? Also, your rownames are all unique, right? (I've made a random `dat`, 30000x50, but I seem to be getting fast times for the `rows` I make up - they're probably not big enough?) — mathematical.coffee, Jan 20 '12 at 04:20
`rows` has about 15000 elements; `length(rows[[i]])` ranges from 1 to 50 — Jack Tanner, Jan 20 '12 at 04:24

Simon Urbanek · Accepted Answer · 2012-01-20T06:18:53.377

18

One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
   user  system elapsed 
 69.207   5.545  74.787 

# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
   user  system elapsed 
 36.810  10.003  47.082 

# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
   user  system elapsed 
 19.145   3.012  22.226

You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.

edited Jan 20 '12 at 06:18

answered Jan 20 '12 at 05:45

Simon Urbanek

13,842
45
45

fmatch is really quite magical. I now see these timings (compare to those in the question): `user system elapsed 11.48 0.02 11.64` – Jack Tanner Jan 20 '12 at 16:41
...but fmatch doesn't handle updates to the `table` arg: `s<-'a';fmatch('a',s);s[1]<-'b';fmatch('a',s)` The second time, fmatch should not find a match, but it does... In general, it seems dangerous to change NAMED objects, and also to rely on that they are never modified again... – Tommy Jan 20 '12 at 20:09
Yes, and the docs warn about that. That's the price you pay for speed ;) Unfortunately there is no way in R to get notified on update. And the issue is not `NAMED` at all but the fact that R will happily copy the attribute onto a new object. – Simon Urbanek Jan 20 '12 at 23:58
Actually, I found a way to identify the case that you illustrated (by storing the parent object in the hash), so the next release of `fastmatch` should be able to detect copied attributes that are out of sync. Thanks for the example :). – Simon Urbanek Jan 21 '12 at 01:26
Great that you could address that case. And the NAMED issue is that you are "breaking the rules" by modifying a NAMED object by assigning a new attribute to it. Perhaps the benefits outweigh the risks in your case. A rule-abiding solution would be a bit more cumbersome to use: `m <- fmakeMap(table); fmatch(x, m)` – Tommy Jan 22 '12 at 08:17

Steve Lianoglou · Answer 2 · 2012-01-20T16:47:37.710

Update

My original post started with this erroneous statement:

The problem with indexing via rownames and colnames is that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".

Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.

Original Answer

Note that the suggestions in this answer assume that you have non-overlapping subsets of data.

If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.

An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:

dat <- data.frame(a=sample(100, 10),
                  b=rnorm(10),
                  group=sample(c('a', 'b', 'c'), 10, replace=TRUE))

You could then do:

split(dat, dat$group)
$a
   a           b group
2 66 -0.08721261     a
9 62 -1.34114792     a

$b
    a          b group
1  32  0.9719442     b
5  79 -1.0204179     b
6  83 -1.7645829     b
7  73  0.4261097     b
10 44 -0.1160913     b

$c
   a          b group
3 77  0.2313654     c
4 74 -0.8637770     c
8 29  1.0046095     c

Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:

library(data.table)
dat <- data.table(dat, key="group")

Now do your list thing -- which will give you the same result as the split above

 x <- lapply(unique(dat$group), function(g) dat[J(g),])

But you probably want to "work over your spits", and you can do that inline, eg:

ans <- dat[, {
  ## do some code over the data in each split
  ## and return a list of results, eg:
  list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]

ans
     group nrow mean.a     mean.b
[1,]     a    2   64.0 -0.7141803
[2,]     b    5   62.2 -0.3006076
[3,]     c    3   60.0  0.1240660

You can do the last step in "a similar fashion" with plyr, eg:

library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
      mean.b=mean(b))
  group nrow mean.a     mean.b
1     a    2   64.0 -0.7141803
2     b    5   62.2 -0.3006076
3     c    3   60.0  0.1240660

But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.

"The problem with indexing via rownames and colnames is that you are running a vector/linear scan for each element" - is plain wrong, R is not that stupid - it uses hash tables for indexing. However, since partial matching is the default you may do better by using `match` to avoid that (or preferably `fastmatch` since you'd like to re-use the hash table) - for examples I've added a response. — Simon Urbanek, Jan 20 '12 at 05:34
@SteveLianoglou, despite the error on the linear scan comment, thank you for demonstrating the use of `split`. — Jack Tanner, Jan 20 '12 at 16:39

mathematical.coffee · Answer 3 · 2012-01-20T04:57:41.803

Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.

First create some data of the same size as yours and assign some numeric rownames:

> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050"  "22570" "28140" "9576"

Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):

# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))

For timing purposes, try the method in your question (ouch!):

# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
   user  system elapsed 
182.306   0.877 188.362

Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.

FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:

> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.

Now look up row indices instead of row names:

> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
   user  system elapsed
 32.424   0.060  33.050

Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):

> all(rownames(res1)==rownames(res2))
[1] TRUE

So, a ~6x speedup. Still not amazing though...

SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:

map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix

Then use map as before (dat[map[as.numeric(r),]]).

score 2 · Answer 4 · answered Jan 20 '12 at 04:50

2

You could try this modification:

system.time(lapply(rows, function(r) {dat[ rownames(dat) %in% r, ]}))

answered Jan 20 '12 at 04:50

IRTFM

258,963
21
364
487

this one absolutely killed my computer -- I reckon `dat[rownames(dat) %in% r,]` is slower than `dat[r,]` – mathematical.coffee Jan 20 '12 at 05:04
Yeah. This one almost doubles the time required. But thanks for your suggestion. – Jack Tanner Jan 20 '12 at 05:05

Tyler Rinker · Answer 5 · 2012-01-20T04:52:43.133

1

I agree with mathematical coffee that I too get fast times for this.

Don't know if it's possible but by unlisting as a vector and then converting to numeric you can get a speed boost.

dat <- data.frame(matrix(rnorm(30000*50), 30000, 50 ))
rows <- as.numeric(unlist(list(c("34", "36", "39"), c("45", "46"))))
system.time(lapply(rows, function(r) {dat[r, ]}))

EDIT:

dat$observ <- rownames(dat)
rownames(dat) <- 1:nrow(dat)

edited Jan 20 '12 at 04:52

answered Jan 20 '12 at 04:35

Tyler Rinker

108,132
65
322
519

However the rows are row names not row indices, so `as.numeric` will cause the wrong rows to be extracted. – mathematical.coffee Jan 20 '12 at 04:35
could the row names be converted to numeric or they actual character vectors? If they're character vectors I'd suggest making that another variable/column and having numeric row names. This gives you numeric rownames and retains your information. I'll demonstrate. – Tyler Rinker Jan 20 '12 at 04:38
Yeah I get the character rownames problem now. It's possible a hash table could be used here but I'd see how DWin and mathematical coffee's solutions pan out first. – Tyler Rinker Jan 20 '12 at 04:58

fast subsetting in R

5 Answers5

Update

Original Answer

Linked