0

I have a vector

a <- c("there and", "walk and", "and see", "go there", "was i", "and see", 
"i walk", "to go", "to was")

and a data frame bg where

bg <- data.frame(term=c("there and", "walk and", "and see", "go there", "was i", "and see",
"i walk", "to go", "to was"), freq=c(1,1,2,1,1,2,1,1,1))

I need to create a vectorized version for the following code using either sapply,tapply, or vapply or apply etc

 d <- NULL
 for(i in 1:length(a)){
     temp <- filter(bg,term==a[i])
     d <- rbind(d,temp)
 }

The need is search the bg data when term==a[i] and create a data frame d

I need a vector version as for loops are excruciatingly slow in R.

Here is the sample data

> bg
       term freq
1 there and    1
2  walk and    1
3   and see    2
4  go there    1
5     was i    1
6   and see    2
7    i walk    1
8     to go    1
9    to was    1

and

>d
       term freq
1 there and    1
2  walk and    1
3   and see    2
4   and see    2
5  go there    1
6     was i    1
7   and see    2
8   and see    2
9    i walk    1
10    to go    1
11   to was    1

Thanks

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Tinniam V. Ganesh
  • 1,979
  • 6
  • 26
  • 51
  • 3
    That for loop is excruciatingly slow because you are building the structure inside the loop instead of allocating the memory for the vector beforehand and then binding the vectors after the loop has ended. Please show what you want the desired result to look like – Rich Scriven Aug 25 '15 at 04:17
  • Your initial statement about `for` loops is not totally true: http://stackoverflow.com/a/7142982/3710546 –  Aug 25 '15 at 04:19
  • @RichardScriven - `dplyr::filter` i imagine. – thelatemail Aug 25 '15 at 04:22
  • @RichardScriven yes I am using dplyr filter as seen above. dplyr::filter is fast but the for loop is murder. My data frame has 300K rows and the computation is taking 'for'ever. – Tinniam V. Ganesh Aug 25 '15 at 04:44
  • @Pascal I managed to vectoriize other versions and the performance improvement is almost logarithmic, I think. – Tinniam V. Ganesh Aug 25 '15 at 04:48
  • `merge(data.frame(table(term=a)), bg, by="term")` – thelatemail Aug 25 '15 at 04:53
  • @latemail - Looks good. May need to massage the output.Let me check. Will get back to you later today. – Tinniam V. Ganesh Aug 25 '15 at 04:58
  • 1
    @TinniamV.Ganesh - maybe just `merge(data.frame(term=a), bg, by="term", sort=FALSE)` going by your updated data. – thelatemail Aug 25 '15 at 05:07
  • 1
    Or using the devel version of `data.table` `data.table(term=a)[bg, on='term']` – akrun Aug 25 '15 at 05:09
  • @akrun - how new does data.table have to be to use that code? No go over here on 1.9.4 – thelatemail Aug 25 '15 at 05:19
  • @thelatemail I meant the `1.9.5`. For `1.9.4`, we have to set the key, instead of the `on` – akrun Aug 25 '15 at 05:20

1 Answers1

3

This essentially becomes a merge operation, with a little twist to make sure that the row order follows the order in a:

out <- merge(bg, list(term=a, sortid=seq_along(a)), by="term")
out[order(out$sortid),]

#        term freq sortid
#7  there and    1      1
#10  walk and    1      2
#1    and see    2      3
#3    and see    2      3
#5   go there    1      4
#11     was i    1      5
#2    and see    2      6
#4    and see    2      6
#6     i walk    1      7
#8      to go    1      8
#9     to was    1      9

Or in data.table 1.9.5, with a nod to @akrun:

library(data.table)
out <- data.table(term=a, sortid=seq_along(a))[setDT(bg), on='term']
out[order(out$sortid)]

Or in dplyr:

left_join(data.frame(term=a), bg)
thelatemail
  • 91,185
  • 12
  • 128
  • 188