2

I would like to know if and how I could make my code more efficient by using vectorized functions instead of for loops.

I am working on a dataset with around 1.6 million observations. I want to adjust the prices for inflation so I need to match the month of the observation with the month of the corresponding CPI index. I have a main data frame (the one with 1.6 million observations) and a data frame with the CPI index I need (this only has 12 observations, one for each month in the year my analysis is taking place).

Here is how I tried to "match" each observation with its corresponding CPI index:

`for(i in 1:nrow(large.data.frame)){
  for(j in 1:nrow(CPI)){
    if(months(large.data.frame[i,"Date"])==months(CPI[j,"Date"])){
      CPImatch[i] <- CPI[j,2]
    }
    else next
  }
 }`

NOTE: CPImatch is a separate data frame I was going to use to place the matched values in and then cbind it with my initial data frame. As well, I know there is probably a better way to do this...

Since my code is still running, I know that this is an incredibly inefficient (and maybe even wrong) way of doing what I want to do. Is there a way of vectorizing this loop, possibly with a function from the apply family?

Any feedback is greatly appreciated!

Philipp HB
  • 169
  • 1
  • 14
aMarsh
  • 33
  • 2
  • 1
    It would be good if you could edit your question with a small example of both your large.data.frame, and the CPI data, and the expected outcome please. This looks like a loop will not be required, perhaps just matching. [Info on making a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – user20650 Apr 18 '16 at 21:47
  • One of the biggest things you can doto improve your speed would be to preallocate above your loops CPImatch: `CPImatch <- numeric(nrow(large.data.frame))` – lmo Apr 18 '16 at 22:18
  • 1
    This article on functionals helped me get started: http://adv-r.had.co.nz/Functionals.html. – AllanT Apr 18 '16 at 22:47

1 Answers1

1

You code can certainly be made much faster. One simple step would be to pre-calculate the months rather than calculating it many many times. Vectorisation will make it even faster. I think the following code should work, mapping the months to CPI - difficult to test without some test data.

require(plyr)
CPImatch <- mapvalues(months(large.data.frame$Date), from  = months(CPI$Date), to = CPI[,2])
Richard Telford
  • 9,558
  • 6
  • 38
  • 51