3

I'd like to improve performance of my Dynamic Linked Library (DLL).

For that I want to use lookup tables of cos() and sin() as I use a lot of them.

As I want maximum performance, I want to create a table from 0 to 2PI that contains the resulting cos and sin computations.

For a good result in term of precision, I think tables of 1 mb for each function is a good trade between size and precision.

I would like to know how to create and uses these tables without using an external file (as it is a DLL) : I want to keep everything within one file.

Also I don't want to compute the sin and cos function when the plugin starts : they have to be computed once and put in a standard vector.

But how do I do that in C++?

EDIT1: code from jons34yp is very good to create the vector files.

I did a small benchmark and found that if you need good precision and good speed you can do a 250000 units vector and linear interpolate between them you will have a 7.89E-11 max error (!) and it is the fastest between all the approximations I tried (and it is more than 12x faster than sin() (13,296 x faster exactly)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
IonOne
  • 385
  • 4
  • 15
  • http://stackoverflow.com/questions/4864866/c-c-with-gcc-statically-add-resource-files-to-executable-library – fatihk Aug 30 '13 at 11:23
  • You could write a helper program/script which will generate the table in C++ source code syntax. – Angew is no longer proud of SO Aug 30 '13 at 11:31
  • 1
    Have you profiled and determined that these two functions are indeed hotspots in your code? On modern processors those are generally pretty fast and you may find that the table based solution ends up being slower and less cache friendly. – Retired Ninja Aug 30 '13 at 11:32
  • @ thomas : it's for linux not windows. Also i'd like to know how to get a vector from the datas and in which format store the datas – IonOne Aug 30 '13 at 11:32
  • @ninja : yes they are hotspots in my code. – IonOne Aug 30 '13 at 11:33

3 Answers3

3

Easiest solution is to write a separate program that creates a .cc file with definition of your vector.

For example:

#include <iostream>
#include <cmath>

int main()
{
    std::ofstream out("values.cc");

    out << "#include \"static_values.h\"\n"; 
    out << "#include <vector>\n";

    out << "std::vector<float> pi_values = {\n";
    out << std::precision(10);

    // We only need to compute the range from 0 to PI/2, and use trigonometric
    // transformations for values outside this range.
    double range = 3.141529 / 2;
    unsigned num_results = 250000;

    for (unsigned i = 0; i < num_results; i++) {
        double value = (range / num_results) * i;
        double res = std::sin(value);

        out << "    " << res << ",\n";
    }
    out << "};\n"
    out.close();
}

Note that this is unlikely to improve performance, since a table of this size probably won't fit in your L2 cache. This means a large percentage of trigonometric computations will need to access RAM; each such access costs roughly several hundreds of CPU cycles.

By the way, have you looked at approximate SSE SIMD trigonometric libraries. This looks like a good use case for them.

  • This is an interesting question: a cache miss is expensive, but is it more or less expensive than calculating a `sin` or a `cos`? Those functions aren't cheap either. – James Kanze Aug 30 '13 at 11:45
  • And are you sure that his compiler can handle tables with 250000 entries. (I've blown up compilers before with machine generated code.) – James Kanze Aug 30 '13 at 11:45
  • SSE SIMD libraries that I've used compute one value of sin in roughly several tens of cycles. For reduced accuracy the performance can even be increased. –  Aug 30 '13 at 11:46
  • @jons34yp : thanks. are you sure accessing RAM is this slow ? i seems to me that a cycle is a computational cost but accessing RAM is done all the time even with small tables (i used a small table of 4096 values to compute windowing like Hanning or Hamming and it was so muchg faster) so why would it be so slow ? – IonOne Aug 30 '13 at 11:47
  • @JamesKanze At least GCC handles table of this size just fine. –  Aug 30 '13 at 11:54
  • @IonOne Yep, the internal speed of DRAM chips essentially hasn't changed much for twenty years. What has been increasing is the bandwidth between the CPU and RAM, but this doesn't reduce latency. Small tables are OK, since they reside in L1 or L2 caches. Even large tables can be OK if you access them sequentially. Large tables which are accessed randomly, which I assume is the case here, are indeed slow. –  Aug 30 '13 at 11:55
  • @jons34yp It almost certainly depends on the amount of memory available to the compiler. I know that I've had problems with g++ with tables whose total size was around 1MG, but that was some time ago, the machine didn't have a lot of memory, and while the total was around 1MG, it wasn't in a single table, but in a lot of smaller tables. (Which meant symbols, etc. for each.) Since then, I've learned to be leery. But in his case, the simplest solution is to give it a try. If the compiler accepts it, he can then measure the speeds, and see which is faster. – James Kanze Aug 30 '13 at 12:07
  • @jons34yp Just a note on your generator: I had the impression that he also wanted accuracy. In which case, you should declare the table `double`, output at least 17 digits precision, and use more precision for pi. – James Kanze Aug 30 '13 at 12:09
  • @jons34yp And you probably want to generate an `std::array`, or even a C style array, rather than a vector. – James Kanze Aug 30 '13 at 12:12
  • @JamesKanze `float` is probably fine for his use-case, because this is a look-up table anyway and look-up error will be larger anyway. `array` indeed would be better, but the OP asked `vector`, so I didn't think too much. Feel free to edit my answer. –  Aug 30 '13 at 14:08
  • 2
    i tested rapidly with some LUT and even with linear interpolation , this is so much faster than sin()! Even for N=250000 tables – IonOne Aug 30 '13 at 17:54
3

You can use precomputation instead of storing them already precomputed in the executable:

double precomputed_sin[65536];

struct table_filler {
    table_filler() {
        for (int i=0; i<65536; i++) {
            precomputed_sin[i] = sin(i*2*3.141592654/65536);
        }
    }
} table_filler_instance;

This way the table is computed just once at program startup and it's still at a fixed memory address. After that tsin and tcos can be implemented inline as

inline double tsin(int x) { return precomputed_sin[x & 65535]; }
inline double tcos(int x) { return precomputed_sin[(x + 16384) & 65535]; }
6502
  • 112,025
  • 15
  • 165
  • 265
  • 65536 times the sin computation is almost the number of computations i have to do for 1 image so it's not an option (btw i wrote i didn't want on-the-fly computed tables – IonOne Aug 30 '13 at 12:12
  • @IonOne: I think you made some mistake in your measurements. Precomputing a sine table of 65536 elements takes a lot less than loading and initializing a DLL. If you don't reinitialize the DLL for each image why do you need to do this computation each image? BTW computation of a 65536-entries table for sine takes about 0.3ms on my PC (using 4-symmetry): you can really load and initialize your DLL more than 3000 times per second? – 6502 Aug 30 '13 at 16:12
  • okay maybe you're right but anyway 65536 isn't enough for the precision i need (maybe with linear interpolation ...) – IonOne Aug 30 '13 at 17:26
  • 2
    @IonOne: You **always** use interpolation for such table lookup. The loss of accuracy is staggering otherwise. The only thing you'd consider is whether to use linear or better interpolation. – MSalters Aug 31 '13 at 07:42
  • @MSalters : if you have enough elements in the LUT, you won't need to interpolate, so you'll have maximum speed of computation. – IonOne Aug 31 '13 at 11:14
  • @IonOne: Nope. You're talking about a LUT which for the same accuracy is literally a thousand times bigger. That might even end up being swapped to disk. After all, since the non-interpolated LUT has so many entries, most won't be used that often. – MSalters Sep 01 '13 at 19:59
  • I liked the way you initialized it. – fuadj Aug 04 '17 at 19:46
0

The usual answer to this sort of question is to write a small program which generates a C++ source file with the values in a table, and compile it into your DLL. If you're thinking of tables with 128000 entries (128000 doubles are 1MB), however, you might run up against some internal limits in your compiler. In that case, you might consider writing the values out to a file as a memory dump, and mmaping this file when you load the DLL. (Under windows, I think you could even put this second file into a second stream of your DLL file, so you wouldn't have to distribute a second file.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329