1

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.

This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?

Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .

Code:

match.ind=list()

for(i in 1:150000){
    match.ind[[i]]=which(dat.fram[,3]==X[i])
}
10 Rep
  • 2,217
  • 7
  • 19
  • 33
  • Is it possible that the function you need to pass the list to ought to be vectorized as well? Why do you need to examine each individual item in match.ind separately. Do you really just need all of the appropriate values from data.fram at once? – John Jun 07 '12 at 03:01
  • Yes, looping will be really slow here - if I've understood the question correctly though, the code I've posted should do what you require :) – Tim P Jun 07 '12 at 03:59

1 Answers1

1

UPDATE:

Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!

### define v as a sample column of data - you should define v to be 
### the column in the data frame you mentioned (data.fram[,3]) 

v = sample(1:150000, 1500000, rep=TRUE)

### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points

mybiglist = tapply(seq_along(v),v,c)

### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to

X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]

And that's it! As a check, let's look at the first 3 rows of mylist:

> mylist[1:3]

$`1`
[1]  401143  494448  703954  757808 1364904 1485811

$`2`
[1]  230769  332970  389601  582724  804046  997184 1080412 1169588 1310105

$`4`
[1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377

There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the numbers listed against 4 are the index points in v where 4 appears:

> which(X==3)
integer(0)

> which(v==3)
[1]  102194  424873  468660  593570  713547  769309  786156  828021  870796  
883932 1036943 1246745 1381907 1437148

> which(v==4)
[1]  149021  282361  289661  456147  774672  944760  969734 1043875 1226377

Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!

Extra note: You can use the code below to create an NA entry for each member of X not in v...

blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]

Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.

Cheers! :)


ORIGINAL POST BELOW... superseded by the above, obviously!

Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:

X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE), 
               c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)

tapply(X,X,function(x) {which(d[,3]==x)})
Tim P
  • 1,383
  • 9
  • 19
  • I actually got a similar run time from this method to the for loop method I included originally. EDIT: _Was in response to original post._ – interested_in_the_world Jun 07 '12 at 04:11
  • Works very nicely AND quickly! thank you! I'm pretty sure I can reinsert the rows with null values via some other script because if no matches exists for a certain value in `X` I will still need to have a row for it. – interested_in_the_world Jun 07 '12 at 04:14
  • I've included code (under "Extra note") that fills in those additional positions as NA. Change the NA to NULL if you fancy... the main thing is that the names of the final list you create don't miss anything contained in X! :) – Tim P Jun 07 '12 at 06:24