10

I'm looking into converting an R script into C-code for speed reasons and for the ability to have it packaged as an .exe. I'm new to C.

My question is will it be significantly faster in C? The rate limiting step is a sort algorithm that has to be applied a lot of times to big vectors. I'm not sure if the vectorized functionality in R will help this or slow it down. Also I've read that for-loops are inefficient in R.

If I should do this in C, what libraries might help me mimic some of the data-processing functions of R like basic matrix manipulations? Where should I look to get started? Right now I don't even know how to read my data into C (comma delimited text file).

JoshDG
  • 3,871
  • 10
  • 51
  • 85
  • 1
    What sort of speed-up do you hope to achieve ? Why do you think that R will be significantly slower than C for these operations ? Have you studied http://stackoverflow.com/questions/1330944/speed-of-r-programming-language ? – High Performance Mark Apr 13 '12 at 15:14
  • in particular, `rowSums` and `colSums` are already pretty fast -- you probably won't be able to squeeze out a huge amount of performance there (and you should consider the fairly rich sparse-matrix support in the `Matrix` package, if your matrices are sparse ...). Can you be a little more specific about "the ability to have it run other users' machines"? – Ben Bolker Apr 13 '12 at 15:18
  • 1
    That will be a lot of work, esp. since you are starting from scratch with C programming. You could [install R](http://cran.r-project.org/bin/windows/rw-FAQ.html) on the users machines in less time than it would take to convert the R code by hand. – hardmath Apr 13 '12 at 15:18
  • 2
    @Jack I'm not asking anyone to write code for me. I'm asking if a sort algorithm in a C script will be significantly more efficient than in an R script. I think that's a fair question. I'll rephrase it to make it sound less like I'm asking to have my code written for me. – JoshDG Apr 13 '12 at 15:30
  • Following up on @BenBolker's comment, new in R-2.15.0 are the "‘bare-bones’ functions .`colSums()`, .`rowSums()`, `.colMeans()` and `.rowMeans()` for use in programming where ultimate speed is required." (Quoted from the R-2.15.0 `NEWS` file.) I think they just do less checking and require you to supply the matrix dimensions before performing their calculations. – Josh O'Brien Apr 13 '12 at 15:34
  • @JoshO'Brien and Ben Bolker I wasn't so direct in my original post. I think where C might have an edge is with both for-loops (which are apparently slow in R) and with the sort algorithm (which if I can optimize even a little in C..it will have a big effect because it does it a lot of times). – JoshDG Apr 13 '12 at 15:43
  • 2
    R's `sort` on vectors uses sensible algorithms implemented in C, so there wouldn't be much scope for speed-up. Maybe you've mis-diagnosed where your bottleneck is, e.g., (doing anything to) a large number of small vectors? But then there's likely a way to recast your problem to make fewer iterations. – Martin Morgan Apr 13 '12 at 15:52
  • 4
    Nobody has mentioned _profile, profile, profile_ before doing redesign. – Dirk Eddelbuettel Apr 13 '12 at 15:53
  • 2
    @DirkEddelbuettel: what? Why not just spend a few days/weeks learning and re-implementing an algorithm in a new language? Then you can test if it's really faster or not in your specific case, _and_ you get to learn a new language! It's not like there's anything better to do... – Joshua Ulrich Apr 13 '12 at 15:58
  • @MartinMorgan - I think the default `sort` (shell sort) is actually pretty bad in R. See my answer for some comparisons. – Tommy Apr 13 '12 at 18:30

1 Answers1

12

I'll try to answer this question as well as I can.

...but the question your NOT asking is perhaps more relevant: Can the R algorithm be made faster in R? The answer here is usually "yes". Can it be "fast enough"? Well, that is impossible to answer without trying (and seeing the current R code).

Q: Will my R algorithm be faster in C?

A: Yes! If you write the "best" C code for the algorithm, it will most likely be faster. It will most likely also be a lot more work to do so.

Q: Can sorting of large vectors be done faster in C?

A: Yes. Using multi-threading, you can improve the speed quite a lot. ...But start by calling sort(x, method='quick') in R and see if that improves things! The default method isn't very fast for random data.

x <- runif(1e7)
system.time( sort(x) )                   # 2.50 secs
system.time( sort(x, method='quick') )   # 1.37 secs
#system.time( tommysort(x) )             # 0.51 secs (4 threads)

Q: What libraries mimic basic R functions?

A: LAPACK/BLAS handles matrix math in R. If that's all you need, you can find libraries that are much faster than the vanilla ones in R (you can use some of them in R too to improve performance!).

More info on BLAS

Another way is to make a .Call from R to C and from there you have access to all of R's functionality! The inline package and the Rcpp package can help make it easier.

A third way is to embed R in your application. Rinside can help make that easier.

Q: How do I read CSV data into C?

A: Look at the fopen and fscanf functions. ...and use them to write a data import function.

Tommy
  • 39,997
  • 12
  • 90
  • 85
  • The target platform must be Windows (.exe is mentioned), and probably AMD64 (32-bit is getting old, and almost nobody owns an Itanium). A good BLAS will use AVX in AMD64, which can speed up the code dramatically (the alternative is a loop, of course). And even in C, Dirk's comment above applies: profile, profile, profile. – Matthew Lundberg Apr 14 '12 at 00:51