2

I would like to query a database using BiomaRt package. I have loci and want to retrieve some related information, let say description.

I first try to use lapply but was surprise by the time needed for the task to be performed. I thus tried a more basic for-loop and get a faster result.

Is that expected or is something wrong with my code or with my understanding of apply ? I read other posts dealing with *apply vs for-loop performance (Here, for example) and I was aware that improved performance should not be expected but I don't understand why performance here is actually lower.

Here is a reproducible example.

1) Loading the library and selecting the database :

library("biomaRt")
athaliana <- useMart("plants_mart_14")
athaliana <- useDataset("athaliana_eg_gene",mart=athaliana)

2) Querying the database :

loci <- c("at1g01300", "at1g01800", "at1g01900", "at1g02335", "at1g02790", 
"at1g03220", "at1g03230", "at1g04040", "at1g04110", "at1g05240"
)

I create a function for the use in lapply :

foo <- function(loci) {
  getBM("description","tair_locus",loci,athaliana)
}

When I use this function on the first element :

> system.time(foo(cwp_loci[1]))
utilisateur     système      écoulé 
      0.020       0.004       1.599

When I use lapply to retrieve the data for all values :

> system.time(lapply(loci, foo))
utilisateur     système      écoulé 
      0.220       0.000      16.376

I then created a new function, adding a for-loop :

foo2 <- function(loci) {
  for (i in loci) {
    getBM("description","tair_locus",loci[i],athaliana)
  }
}

Here is the result :

> system.time(foo2(loci))
utilisateur     système      écoulé 
      0.204       0.004      10.919

Of course, this will be applied to a big list of loci, so the best performing option is needed. I thank you for assistance.

EDIT Following recommendation of @MartinMorgan

Simply passing the vector loci to getBM greatly improves the query efficiency. Simpler is better.

> system.time(lapply(loci, foo))
utilisateur     système      écoulé 
      0.236       0.024     110.512 

> system.time(foo2(loci))
utilisateur     système      écoulé 
      0.208       0.040     116.099 

> system.time(foo(loci))
utilisateur     système      écoulé 
      0.028       0.000       6.193 
Community
  • 1
  • 1
ptocquin
  • 325
  • 3
  • 9
  • 1
    [This](http://stackoverflow.com/q/2275896/324364) question may be of interest to you. – joran Sep 05 '12 at 20:54
  • @joran, but in this it is just calling a function which takes quite long, a same amount of times for both the for loop and the lapply loop. I would not expect there to be a difference. – Paul Hiemstra Sep 05 '12 at 21:00
  • @ptcoquin, if you repeat the procedure 10 times, is the result reproducible? Just to rule out that the connection speed and business of the data base plays a role or not. In general, these kinds of questions are hard to answer because your example involves a database we do not have. – Paul Hiemstra Sep 05 '12 at 21:02
  • @PaulHiemstra I had no idea what the function was doing, but I still think that link is helpful for understanding this kind of issue. – joran Sep 05 '12 at 21:16
  • @joran I do not think the issue is related to apply vs for loop. There are for too little function calls for it to matter. – Paul Hiemstra Sep 05 '12 at 21:20
  • 5
    More than data base access, the query is to a web-based resource so one would expect very large variation independent of any R coding differences. Also, biomaRt queries are generally vectorized; this `getBM("description", "tair_locus", loci, athaliana)` for your vector of 10 loci returns a vector of 10 descriptions. – Martin Morgan Sep 05 '12 at 21:20
  • I thank you all for answers and comments. I tried again today and see no more significant differences when running the code repeatedly. So, I guess you right about the problem coming for outside of R. – ptocquin Sep 06 '12 at 07:15
  • @MartinMorgan I thank you for comment but I do not really understand if the vectorization of the query can affect, in anyway, the performance ? What should I understand from your comment ? – ptocquin Sep 06 '12 at 07:43
  • 1
    @ptocquin instead of using a for loop or an lapply, just do a single query, with `loci` containing all the values you want. This will be fast, because there will only be one trip over the internet to the biomaRt server. – Martin Morgan Sep 06 '12 at 13:03
  • @MartinMorgan... Great ! Thank you so much ! The performance is improved almost 20x by simply passing my vector `loci` directly to getBM... So stupid... I edit my answer accordingly. – ptocquin Sep 06 '12 at 21:31

2 Answers2

4

I have a feeling this has nothing to do with lapply vs for loop. There is a call to a database which takes 1.6 secs, the lapply does this 10 times, i.e. a total of 16 secs. The for loop does nothing different, it calls the database 10 times. However, it performs the same queries the apply loop did, maybe causing the database to return a cached result instead of querying the database. This can be checked by reversing the order of the calls to see if the times are in favor of lapply then. Thanks to @JoshuaUlrich for suggesting database caching in the chat. In addition to database caching (or caching by the webserver), @MartinMorgan's suggestion adds a fair amount of randomness to how long getBM takes.

Bottomline, chances are very big the differences are outside of R and not related to the style of loops, lapply or for loop.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
2

The apply family of functions, while idiomatic, are just syntactic sugar for for loops. They're often slower. As joran has pointed out, this has been asked on SO before.

If you'd like tips on speeding up a loop in R, see page 46 of this volume of R News (a pdf), called "R Help Desk--How Can I Avoid This Loop or Make It Faster?"

Community
  • 1
  • 1
Wilduck
  • 13,822
  • 10
  • 58
  • 90
  • But I would not expect there to be such a big difference when the majority of the time is spent inside the function, with only a small amount of function calls. – Paul Hiemstra Sep 05 '12 at 21:05
  • 1
    That's a fair assertion. Like you said database access could play a factor. It's also possible the the fact that `lapply` does not require you to specify a structure to your data (and therefore has to guess) is part of the slow down. See http://www.r-bloggers.com/the-performance-cost-of-a-for-loop-and-some-alternatives/. All in all, I think the OP's best bet is to do a little reading about vectorization in the link I provided. – Wilduck Sep 05 '12 at 21:12