16

I have a piece of code that traces 4 sines at a time.

My original code was making roughly 12000 sin() function calls per frame and was running at 30 fps.

I tried optimizing it by generating lookup tables. I ended up with 16 different lookup tables. I declared and load them in a separate header file at the top of my program. Each table is declared like so:

static const float d4_lookup[800] {...};

Now, with this new method I actually lost fps?! I'm running at 20 fps now instead of 30. Each frame now only has to do 8 sin / cos calls and 19200 lookup calls vs 12000 sin() calls. I compile using gcc with -O3 flag on. At the moment, the lookup tables are included at the top and are part of the global scope of the program.

I assume I'm not loading them in the right memory or something to that effect. How can I speed up the lookup time?

** EDIT 1 **

As requested, here's the function that uses the lookup calls, it is called once per frame:

void
update_sines(void)
{
    static float c1_sin, c1_cos;
    static float c2_sin, c2_cos;
    static float c3_sin, c3_cos;
    static float c4_sin, c4_cos;

    clock_gettime(CLOCK_MONOTONIC, &spec);
    s = spec.tv_sec;
    ms = spec.tv_nsec * 0.0000001;
    etime = concatenate((long)s, ms);

    c1_sin = sinf(etime * 0.00525);
    c1_cos = cosf(etime * 0.00525);
    c2_sin = sinf(etime * 0.007326);
    c2_cos = cosf(etime * 0.007326);
    c3_sin = sinf(etime * 0.0046);
    c3_cos = cosf(etime * 0.0046);
    c4_sin = sinf(etime * 0.007992);
    c4_cos = cosf(etime * 0.007992);

    int k;
    for (k = 0; k < 800; ++k)
    {       
        sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
        sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
        sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
        sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;
    }

}

** UPDATE **

For anyone reading this thread, I gave up on this problem. I tried using OpenCL kernels, structs, SIMD instructions as well as all the solutions shown here. In the end the original code that computed the sinf() 12800 per frame worked faster than the lookup tables since the lookup tables didn't fit into the cache. Yet it was still only doing 30 fps. It just had too much going on to keep up with my 60fps expectations. I've decided to take a different direction. Thanks to everyone who contributed to this thread. Most of these solutions would probably work to get some half decent speed improvements but nothing like the 200% speed up I needed here to have the lookup tables work the way I wanted.

Community
  • 1
  • 1
ReX357
  • 1,199
  • 3
  • 19
  • 43
  • 2
    Why do you have 19200 lookup calls as opposed to 12000 sin() calls? – rohit89 Jan 03 '14 at 09:37
  • 1
    How do you access lookup tables and how are you calculating your index ? – Martin Perry Jan 03 '14 at 09:37
  • If you expect us to figure out how you went wrong, show your code. – Barmar Jan 03 '14 at 09:38
  • @MartinPerry simply call the table like a1_lookup[i] wherever I need the value. – ReX357 Jan 03 '14 at 09:38
  • @Barmar Original post was updated with code as requested. – ReX357 Jan 03 '14 at 09:40
  • @rohit89 I originally had a function of the type sine[x] = A * sin(Bx + C) + D. Now, I was trying to cut the sin() calls down so I used a trig identity to calculate all the frames from x = 0. The equivalence is sin(A + B) = sin(A)cos(B) + sin(B)cos(A). Therefore, in the middle of the equation, I have now 2 lookup calls instead of 1 sin call. Hope I'm making sense. – ReX357 Jan 03 '14 at 09:42
  • Did you actually profile the code before doing all that? – Dariusz Jan 03 '14 at 09:43
  • We don't need to see the code that creates the lookup tables, we need to see the code that USES it. That's the bottleneck, isn't it? – Barmar Jan 03 '14 at 09:44
  • @Dariusz Yes. My bottleneck was 100.7% (lol) coming from that function and since I can't see additions and multiplications eating that much memory, the sin() calls were the only thing left creating a bottleneck. – ReX357 Jan 03 '14 at 09:44
  • BTW, if lookup tables were an efficient way to compute this, don't you think the built-in functions would use them? – Barmar Jan 03 '14 at 09:44
  • @Barmar that is the code that uses it! I'm calculating the sines from the values pulled from the lookup tables. – ReX357 Jan 03 '14 at 09:45
  • It makes 8 calls to `sinf` and `cosf`, how can that be faster than calling `sin` once? – Barmar Jan 03 '14 at 09:46
  • @Barmar Yes. Look at the for loop at the bottom. There is only a few values that I compute right before I calculate the sines because x is measured according to the current system time in centiseconds. – ReX357 Jan 03 '14 at 09:47
  • The total size of your lookup tables is 50kB. This exceeds the L1 cache... – Oliver Charlesworth Jan 03 '14 at 09:48
  • @Barmar 8 calls to sin/cos per frame vs 12000 sin/cos calls per frame? – ReX357 Jan 03 '14 at 09:48
  • @OliCharlesworth Hey man, glad you found me again haha. What's my best bet? My tables use up 125kb of space split into 16 tables. – ReX357 Jan 03 '14 at 09:49
  • 1
    Why are your lookups in 0..799 range? Seems weird for a 3.14 periodic function. Maybe make 157 or 314 sized lookups if the precision is satisfactory? Besides, why not a single lookup funcion for `sin` itself and sin only? Less memory throughput (not loading all 16 lookups...). Also remember that `sin a = cos (a-PI/2)`. It's periodic, so a single lookup table and parameter modification will be enough. – Dariusz Jan 03 '14 at 09:51
  • I guess I don't understand what this code is replacing. I was expecting to see a function that was a drop-in replacement for `sin`, but used a lookup table instead of computing the sin. How does the above code compute the same thing that calling `sin` 12000 times does? – Barmar Jan 03 '14 at 09:51
  • @Dariusz I plot this line in a 800 pixels wide window. The Amplitude of the sine is a sinusoidal function itself. I am randomizing sines over time based on time elapsed. Think of the playstation 3 os background. That's basically what I'm doing here. A, B and D are sinusoidal functions as well and C is driven by elapsed time. So the final equation for 1 sine looks like sin * sin(sin + C) + sin. – ReX357 Jan 03 '14 at 09:56
  • "BTW, if lookup tables were an efficient way to compute this, don't you think the built-in functions would use them?" -- different domains, of course. – Jim Balter Jan 03 '14 at 09:57
  • @Barmar Originaly, the for loop was computing A * sin(B * x + C) + D 16 times per for loop times 800 times (So 12800 times per frame). I simply precomputed all the sin calls I could precompute and managed to take my sin calls down to 8 per frame. – ReX357 Jan 03 '14 at 10:00
  • @Barmar If you wanna look at the original function, I think Oli edited my post at the top and linked the words "original code" to my other question where I was asking about optimization. The original code is there for you to see. – ReX357 Jan 03 '14 at 10:03
  • just try and replace sin: http://lab.polygonal.de/?p=205 – Dariusz Jan 03 '14 at 10:03
  • look up tables give the best performance for trigonometric function if hey are kept small enough so that you can take advantage of the cache memory. Once your table get to big you actually suffer because of cache locality. One nice compromise is using the Cordic algorithm (it uses smaller look up tables and is not that computationally intensive). You can find the code on online, so it should be straight forward to test. – Pandrei Jan 03 '14 at 10:50
  • @OliCharlesworth Is there anyway that I could store these lookup tables in VBO's and use the GPU to process the arithmetics? – ReX357 Jan 04 '14 at 06:07

4 Answers4

7

Sometimes it's hard to know what's slowing you down, but potentially you are going to ruin your cache hits, you could try a lookup of a struct

typedef struct 
{
  float bx1_sin;
  float bx2_sin;
  float bx3_sin;
  float bx4_sin;
  float bx1_cos;
 etc etc
 including  sine1,2,3,4 as well

} lookup_table

then

lookup_table  lookup[800]

now everything at the kth lookup will be in the same small chunk of memory.

also, if you use a macro that takes k as a parameter to do do the contents of the loop lets say SINE_CALC(k), or an inline function...

you can do

for (k = 0; k < 800; ++k)
{
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
  SINE_CALC(k); k++;
}

if you do a macro, make sure the k++ is outside the macro call like shown

talonmies
  • 70,661
  • 34
  • 192
  • 269
Keith Nicholas
  • 43,549
  • 15
  • 93
  • 156
  • This is an interesting concept. So I would basically create this struct and have a function inside of it that would return the k'th position of each table as a 16 value array? Still though, the lookup calls still end up being made just in a different part of the program no? I'm sorry I'm not exactly familiar with the way C manages memory at the low level and to me it just looks like an extra call, and an extra array saved up in memory. – ReX357 Jan 03 '14 at 10:36
  • C memory is very simple -- arrays are linear in memory. What more do you need to know about it? Everything else is up to the processor, not C. – Barmar Jan 03 '14 at 10:38
  • when you make several different arrays, they are in different chunks of memory. cache hits come when you fetch something already cached, the struct is basically making sure that all the info on the kth iterations will be in the cache. – Keith Nicholas Jan 03 '14 at 10:39
  • Automatic variables are on the stack, global and static variables are in the data segment, and dynamically allocated data is in the heap. But they're all just different areas of memory, implemented exactly the same by the processor. – Barmar Jan 03 '14 at 10:39
  • I think you should go to the Wikipedia pages on l1 and l2 cache, as well as googling them. You'll find lots of university lecture notes about how memory cache hierarchies work. – Barmar Jan 03 '14 at 10:40
  • @Barmar Do the stack, the heap and the data segment all have the same speed of access? – ReX357 Jan 03 '14 at 10:41
  • @ReX357 You ignored the important statement in the answer: "now everything at the kth lookup will be in the same small chunk of memory". The result is that you will have no cache misses. – Jim Balter Jan 03 '14 at 10:41
  • @ReX357 RAM is RAM ... they all have the same access speed. – Jim Balter Jan 03 '14 at 10:42
  • also sine1,2,3,4 should be in the cache..... I'm not sure what will happen with c1_sin c2_sin etc – Keith Nicholas Jan 03 '14 at 10:44
  • Keith, you're on the right track but you really need to work this out more. k shouldn't be bumped inside the loop; there should be a another index that ranges over 1,2,3,4. – Jim Balter Jan 03 '14 at 10:47
  • stack heap and data segment are sections of RAM, however, the CPU does its own thing when actually executing, access to RAM is slow compared to the blinding speed the CPU can operate at, so it has caches between the RAM and the processor that are super fast. The CPU also has instruction pipelines... some data ends up in registers, some ends up in cache, now you can kill your cache by asking for something thats not in cache, and you can kill your pipelining and cache by any branching – Keith Nicholas Jan 03 '14 at 10:48
  • operative word being *can*. It's hard to predict what's actually going to happen. In my experience, loop unrolling ( eliminating branching ) often gets the most gains. – Keith Nicholas Jan 03 '14 at 10:53
  • Also, the struct should contain result (sine), a, bx_sin, bx_cos, and d, with 4 separate tables. Two nested loops, with the 1-4 (or 0-3) loop outside and the 0-799 loop inside. – Jim Balter Jan 03 '14 at 10:53
  • I'm not sure this approach will make a great deal of difference to the cache hit ratio. Sure, you minimise the cache misses for a *single* iteration (compared to the OP's original layout), but on the next iteration, you'll get another cache miss (whereas the OP's layout won't). So it probably works out about the same, AFAICS. (The only difference would be if the OP's arrays were a power-of-two and thus causing aliasing, but they aren't...) – Oliver Charlesworth Jan 04 '14 at 13:46
  • @OliCharlesworth Would I benefit from increasing my array size to 1024 padding the end with zeros? – ReX357 Jan 04 '14 at 22:16
5

Try unrolling your loops like this:

for (k = 0; k < 800; ++k)
{       
    sine1[k] = a1_lookup[k];
    sine2[k] = a2_lookup[k];
    sine3[k] = a3_lookup[k];
    sine4[k] = a4_lookup[k];
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] *= ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k]));
    sine2[k] *= ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k]));
    sine3[k] *= ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k]));
    sine4[k] *= ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k]));
}
for (k = 0; k < 800; ++k)
{       
    sine1[k] += d1_lookup[k];
    sine2[k] += d2_lookup[k] + 50;
    sine3[k] += d3_lookup[k];
    sine4[k] += d4_lookup[k] + 50;
}

By accessing fewer lookup tables in each loop, you should be able to stay in the cache. The middle loop could be split up as well, but you'll need to create an intermediate table for one of the sub-expressions.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Unfortunately I got no improvement from this. I wonder if I could include the required tables within the scope of the for loops they will be used in instead of loading all 16 of them at once at the top? – ReX357 Jan 03 '14 at 10:20
  • 1
    I don't see how that could help. Stack memory is no faster to access than static data, and you'll have to copy the tables to the stack every time. – Barmar Jan 03 '14 at 10:23
  • I think you may need to use some low-level diagnostics to see your cache effectiveness during this code. – Barmar Jan 03 '14 at 10:24
  • Any chance you could guide me in the right direction for the low level diagnosing? I've used valgrind and gprof but with limited insight into my problem. I do not have a growing memory leak on valgrind and gprof tells me that update_sines() uses 100% of the cpu time. I'm unfamiliar with the tools you can use. I would assume something that could show me how much cache is used / requested in real time perhaps? – ReX357 Jan 03 '14 at 10:26
  • Could I perhaps allocate more cache to my program? – ReX357 Jan 03 '14 at 10:28
  • The only thing I can find is: http://stackoverflow.com/questions/2486840/linux-c-how-to-profile-time-wasted-due-to-cache-misses – Barmar Jan 03 '14 at 10:29
  • Caches are hardware, built into the CPU. – Barmar Jan 03 '14 at 10:30
  • " loading all 16 of them at once at the top?" -- Nothing's getting loaded there. You have a fundamental misunderstanding of the memory model. – Jim Balter Jan 03 '14 at 10:35
  • @JimBalter Not gonna argue that statement 1 second. I'm a hobbyist programmer and this is teaching me a great deal right now. – ReX357 Jan 03 '14 at 10:37
  • @ReX357 you can use `perf` tool to get stats on cache performance if you're on linux. – rohit89 Jan 03 '14 at 10:41
1

Intel processors can predict serial access (and perform prefetch) for up to 4 arrays both for forward and backward traverse. At least this was true in Core 2 Duo days. Split your for in:

for (k = 0; k < 800; ++k)
    sine1[k] = a1_lookup[k] * ((bx1_sin_lookup[k] * c1_cos) + (c1_sin * bx1_cos_lookup[k])) + d1_lookup[k];
for (k = 0; k < 800; ++k)
    sine2[k] = a2_lookup[k] * ((bx2_sin_lookup[k] * c2_cos) + (c2_sin * bx2_cos_lookup[k])) + d2_lookup[k] + 50;
for (k = 0; k < 800; ++k)
    sine3[k] = a3_lookup[k] * ((bx3_sin_lookup[k] * c3_cos) + (c3_sin * bx3_cos_lookup[k])) + d3_lookup[k];
for (k = 0; k < 800; ++k)
    sine4[k] = a4_lookup[k] * ((bx4_sin_lookup[k] * c4_cos) + (c4_sin * bx4_cos_lookup[k])) + d4_lookup[k] + 50;

I guess you have more cache load than benchmarks in other answers so this does matters. I recommend you not to unroll loops, compilers do it well.

Aleksandr Pakhomov
  • 753
  • 1
  • 5
  • 13
0

Using a simple sin lookup table will yields >20% speed increase on my linux machine (vm, gcc, 64bit). Interestingly, the size of lookup table (within reasonable < L1 cache size values) does not influence the speed of execution.

Using a fastsin simple implementation from here I got >45% improvement.

Code:

#include <math.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
#include <time.h>

#define LOOKUP_SIZE 628


uint64_t currentTimestampUs( void )
{
  struct timeval tv;
  time_t localTimeRet;
  uint64_t timestamp = 0;
  //time_t tzDiff = 0;
  struct tm when;
  int64_t localeOffset = 0;

  {
    localTimeRet = time(NULL);
    localtime_r ( &localTimeRet, &when );
    localeOffset = when.tm_gmtoff * 1000000ll;
  }

  gettimeofday ( &tv, NULL );
  timestamp = ((uint64_t)((tv.tv_sec) * 1000000ll) ) + ( (uint64_t)(tv.tv_usec) );
  timestamp+=localeOffset;

  return timestamp;
}


const double PI   = 3.141592653589793238462;
const double PI2  = 3.141592653589793238462 * 2;

static float sinarr[LOOKUP_SIZE];

void initSinArr() {
  int a =0;
  for (a=0; a<LOOKUP_SIZE; a++) {
    double arg = (1.0*a/LOOKUP_SIZE)*((double)PI * 0.5);
    float sinval_f = sin(arg); // double computation earlier to avoid losing precision on value
    sinarr[a] = sinval_f;
  }
}

float sinlookup(float val) {
  float normval = val;
  while (normval < 0) {
    normval += PI2;
  }
  while (normval > PI2) {
    normval -= PI2;
  }
  int index = LOOKUP_SIZE*(2*normval/PI);

  if (index > 3*LOOKUP_SIZE) {
    index = -index + 4*LOOKUP_SIZE;//LOOKUP_SIZE - (index-3*LOOKUP_SIZE);
    return -sinarr[index];
  } else if (index > 2*LOOKUP_SIZE) {
    index = index - 2*LOOKUP_SIZE;
    return -sinarr[index];
  } else if (index > LOOKUP_SIZE) {
    index = 2*LOOKUP_SIZE - index;
    return sinarr[index];
  } else {
    return sinarr[index];
  }
}


float sin_fast(float x) {
  while (x < -PI)
      x += PI2;

  while (x >  PI)
      x -= PI2;

  //compute sine
  if (x < 0)
      return 1.27323954 * x + .405284735 * x * x;
  else
      return 1.27323954 * x - 0.405284735 * x * x;

}

int main(void) {
  initSinArr();
  int a = 0;
  float val = 0;
  const int num_tries = 100000;

  uint64_t startLookup = currentTimestampUs();

  for (a=0; a<num_tries; a++) {
    for (val=0; val<PI2; val+=0.01) {
      float compval = sinlookup(val);
      (void)compval;
    }
  }

  uint64_t startSin = currentTimestampUs();

  for (a=0; a<num_tries; a++) {
    for (val=0; val<PI2; val+=0.01) {
      float compval = sin(val);
      (void)compval;
    }
  }

  uint64_t startFastSin = currentTimestampUs();

  for (a=0; a<num_tries; a++) {
    for (val=0; val<PI2; val+=0.01) {
      float compval = sin_fast(val);
      (void)compval;
    }
  }
  uint64_t end = currentTimestampUs();

  int64_t lookupMs = (startSin - startLookup)/1000;
  int64_t sinMs = (startFastSin - startSin)/1000;
  int64_t fastSinMs = (end - startFastSin)/1000;
  printf(" lookup: %lld ms\n", lookupMs );
  printf("    sin: %lld ms\n", sinMs );
  printf("   diff: %lld ms\n", sinMs-lookupMs);
  printf("  diff%: %lld %\n", 100*(sinMs-lookupMs)/sinMs);

  printf("fastsin: %lld ms\n", fastSinMs );
  printf("    sin: %lld ms\n", sinMs );
  printf("   diff: %lld ms\n", sinMs-fastSinMs);
  printf("  diff%: %lld %\n", 100*(sinMs-fastSinMs)/sinMs);
}

Sample result:

 lookup: 2276 ms
    sin: 3004 ms
   diff: 728 ms
  diff%: 24 %
fastsin: 1500 ms
    sin: 3004 ms
   diff: 1504 ms
  diff%: 50 %
Dariusz
  • 21,561
  • 9
  • 74
  • 114
  • I just tried implementing this and it's still not going faster. Lost a bit of framerate with this implementation. – ReX357 Jan 03 '14 at 11:56
  • @ReX357 lookup or fast_sin? Either way, in relation to your original implementation, it should be faster. – Dariusz Jan 03 '14 at 12:15
  • Implemented the fast sin function. Put it within the for loop scope. Maintained the exact same frame rate. – ReX357 Jan 03 '14 at 12:39