0

Question

How can I fix the code below so it will run fast enough to process a 3.2 million line query without choking in the calculation section like it is now. Fast here means hours maximum but not days, and I suspect minutes is reasonable on the PC it's running.

Problem

I have a local sql database and I'm trying to pull down data calculate a new value and write it back to a different table. However currently the calculation code is super slow. I am processing approximately 3.2 million records so this is a problem. All of this is running in visual studio R.

I've never used R before so I'm not sure what is what. In fact I had a lot of trouble finding good documentation to get to this point. I suspect from the lack of speed and how it seems to slow down that in the loop I am either reading data using an index on a list or writing to a list that doesn't use a link to the end.

library(RODBC)

db <- odbcConnect("LocalDB")
results <- sqlQuery(db,'Select * from stockAppData;')

print("Completed loading data!")

results$pDate = as.Date(results$pDate)
results$Symbol = as.character(results$Symbol)
results$Exchange = as.character(results$Exchange)

print("Completed formating data")

#Table to temporarily hold results that will be uploaded to sql db.
table <- data.frame(Symbol = character(),
                    Exchange = character(),
                    pDate = as.Date(character()),
                    pCloseChange = double(),
                    stringsAsFactors = FALSE)
print("MADE TABLE")

#This is the loop that seems to be super slow
for (i in 2:dim(results[1])) {
    s <- results$Symbol[i]
    e <- results$Exchange[i]
    d <- results$pDate[i]
    if (results$Symbol[i] == results$Symbol[i - 1]) {
        pcc <- (results$pClose[i] / results$pClose[i - 1]) - 1
        table <- rbind(table, c(s,e,d,pcc))
    } else {
        cat("Calculated Pcgain for: ", results$Symbol[i-1] , "\n" ,sep = "")
    }
}
#Never been here because the loop takes forever
cat("Finished Calculations: returning results to DB")

columnTypes <- list(Symbol = "VARCHAR(10)", Exchange = "Varchar(5)", pDate = "date", pCloseChange = "DOUBLE PRECISION")
sqlSave(db,table,varTypes = columnTypes, rownames = FALSE,colnames = FALSE)

odbcClose(db)

Additional background

Other things I suspect are slow but not to slow is the character compare I had a lot of trouble comparing factors and I don't know how to do it properly. I also don't necessarily need double precision, but I prefer it since it will be a requirement in additional calculations.

Loop Logic

Each time the loop is ran we evaluate the variable pcc which stands for "Percent Change of Close" which is equal to the previous days close divided by the current days close minus 1. The loop simply adds this information along with the symbol, exchange and date from the record on the day we are evaluating to the results table. As explained in an answer I used RBIND in a loop which is definitely a reason it's slow.

The if statement makes sure we skip the first record for a given symbol exchange combination. This makes sure we skip the first day of any security since we cannot calculate pcc since it relies on the previous day and the current symbol would be a different security. See example below *note input and output are simplified examples.

Example input

Symbol Exchange pDate pClose
APP  TSX  2018-01-13 1.00 
APP  TSX  2018-01-14 2.00
APP  NYSE 2018-01-13 2.00
APP  NYSE 2018-01-14 3.00 
APPL TSX  2018-01-13 2.00
APPL TSX  2018-01-14 3.00 

Sample Output

Symbol Exchange pDate pcc
APP  TSX  2018-01-14 1.00
APP  NYSE 2018-01-14 1.00 
APPL TSX  2018-01-14 0.5
Sarzorus
  • 5
  • 5
  • I'm also open to any resources on multi-threading in R since these calculations can be broken up easily. I know it's off topic so If your inclined to send me resources for that please inbox me. – Sarzorus Jun 06 '18 at 15:28
  • 2
    It's very likely that your code might be much faster, but it would help if you state your issue with a reproducible example and a desired output. At least state the logic of what you are trying to do. – nicola Jun 06 '18 at 15:29
  • 1
    Does `results$pCloseChange<-c(NA,results$pClose[-1]/results$pClose[-nrow(results)])` get you close to what you want? – nicola Jun 06 '18 at 15:37
  • 1
    Please (a) share a small, reproducible sample input and desired output (4 dates, 2 symbols, 1 exchange should be plenty), and (b) describe what your loop does in words. See [this excellent FAQ](https://stackoverflow.com/a/5963610/903061) for tips on making reproducible examples in R. If you use `dput()`, it will be copy/pasteable, and questions with copy/pasteable examples tend to get working answers very quickly. – Gregor Thomas Jun 06 '18 at 15:52
  • @Gegor thanks for the suggestions and tips, I have edited in your suggestions. Thanks again! – Sarzorus Jun 06 '18 at 19:03
  • @nicola thanks I've never asked a question on r before so I didn't think about the data, and didn't realize the answer would have to take it into account I've edited the question a bit to add it in. The current calculation gives the right answer it's just slow. See the accepted answer for details on why if your interested. Thank You! – Sarzorus Jun 06 '18 at 19:04
  • There is also a logic error in my code where I forgot to also compare exchanges so asking for proper input/output actually helped me a lot! – Sarzorus Jun 06 '18 at 19:14

2 Answers2

2

There are two major factors in this code which makes it slow. The for loop and the rbind within the loop. R is vectorized language so instead of loop through a list one item at a time, one can compare an entire list at one time.
Binding involves make multiple copies of the data in R. The object creation, deletion and clean-up is time consuming.

#make sample data
Symbol<-rep(c("A", "B", "C"), each=10)
set.seed(1)
pClose<-rnorm(30, 100, 5)
results<-data.frame(Symbol, pClose, stringsAsFactors = FALSE)


library(dplyr)  #need dplyr's lag function
#vectorized the math
#perform all of the calcualations including the wrong ones
pcc<- (results$pClose / lag(results$pClose))-1

#Find the rows which matches the previous row
matchingSymbol = results$Symbol == lag(results$Symbol, default="")

#create the final dataframe with the filtered data
answertable<-cbind(results[matchingSymbol,], pcc[matchingSymbol])

In this solution the calculations are vectorized and the binding is done only once. This code can be further improved but should provide a speed-up over your code should by 1000x.

A good reference is the "R inferno" http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • 3
    beyond making multiple copies, when binding, R has to evaluate the type of _every_ element in a vector to make sure it uses the correct storage mode. Each time you add a row, it reevaluates all of the previous rows. For 3.2 million records, it will evaluate the type of (3,200,000 * 3,200,001) / 2 elements. – Benjamin Jun 06 '18 at 16:34
  • 1
    Just to add something to what @Benjamin said, it has to be noticed that, algorithmically wise, the `rbind` inside a loop is O(n^2), while a vectorized solution is O(n). So, the speedup factor increases with N and might well be above 1000x if N is big enough – nicola Jun 07 '18 at 10:14
0

You can get a bigger performance boost by using data.table as follows:

library(data.table)
setDT(results)
output <- results[, 
    .(pDate=pDate[-1L], pcc=(pClose/shift(pClose)-1)[-1L]), 
    by=.(Symbol, Exchange)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35