Update
My original post started with this erroneous statement:
The problem with indexing via rownames
and colnames
is that you
are running a vector/linear scan for each element, eg. you are hunting
through each row to see which is named "36", then starting from the
beginning to do it again for "34".
Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.
Original Answer
Note that the suggestions in this answer assume that you have non-overlapping subsets of data.
If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.
An alternative is to store your "group" information as another column to your data.frame
, then split
your data.frame
on its group, eg. let's say your recoded data.frame
looks like this:
dat <- data.frame(a=sample(100, 10),
b=rnorm(10),
group=sample(c('a', 'b', 'c'), 10, replace=TRUE))
You could then do:
split(dat, dat$group)
$a
a b group
2 66 -0.08721261 a
9 62 -1.34114792 a
$b
a b group
1 32 0.9719442 b
5 79 -1.0204179 b
6 83 -1.7645829 b
7 73 0.4261097 b
10 44 -0.1160913 b
$c
a b group
3 77 0.2313654 c
4 74 -0.8637770 c
8 29 1.0046095 c
Or, depending on what you really want to do with your "splits", you can convert your data.frame
to a data.table
and set its key to your new group
column:
library(data.table)
dat <- data.table(dat, key="group")
Now do your list thing -- which will give you the same result as the split
above
x <- lapply(unique(dat$group), function(g) dat[J(g),])
But you probably want to "work over your spits", and you can do that inline, eg:
ans <- dat[, {
## do some code over the data in each split
## and return a list of results, eg:
list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]
ans
group nrow mean.a mean.b
[1,] a 2 64.0 -0.7141803
[2,] b 5 62.2 -0.3006076
[3,] c 3 60.0 0.1240660
You can do the last step in "a similar fashion" with plyr
, eg:
library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
mean.b=mean(b))
group nrow mean.a mean.b
1 a 2 64.0 -0.7141803
2 b 5 62.2 -0.3006076
3 c 3 60.0 0.1240660
But since you mention your dataset is quite large, I think you'd like the speed boost data.table
will provide.