0

Consider the following code snippet

double *x, *id;
int i, n; // = vector size

// allocate and zero x
// set id to 0:n-1

for(i=0; i<n; i++) {  
  long iid = (long)id[i];
  if(iid>=0 && iid<n && (double)iid==id[i]){
    x[iid] = 1;
  } else break;
}

The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.

For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.

The questions:

  • Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
  • If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
  • Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?

Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.

For convenience, the entire code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static struct timeval tb, te;

void tic()
{
  gettimeofday(&tb, NULL);
}

void toc(const char *idtxt)
{
  long s,u;
  gettimeofday(&te, NULL);
  s=te.tv_sec-tb.tv_sec;
  u=te.tv_usec-tb.tv_usec;
  printf("%-30s%10li.%.6li\n", idtxt, 
     (s*1000000+u)/1000000, (s*1000000+u)%1000000);
}

int main(int argc, char *argv[])
{
  double *x  = NULL;
  double *id = NULL;
  int i, n;

  // vector size is a command line parameter
  n = atoi(argv[1]);
  printf("x size %i\n", n);

  // not included in timing in MATLAB
  x = calloc(sizeof(double),n);
  memset(x, 0, sizeof(double)*n);

  // create index vector
  tic();
  id  = malloc(sizeof(double)*n);
  for(i=0; i<n; i++) id[i] = i;
  toc("id = 1:n");

  // use id to index x and set all entries to 4
  tic();
  for(i=0; i<n; i++) {  
    long iid = (long)id[i];
    if(iid>=0 && iid<n && (double)iid==id[i]){
      x[iid] = 1;
    } else break;
  }
  toc("x(id) = 1");
}
angainor
  • 11,760
  • 2
  • 36
  • 56
  • 1
    @WilliamMorris why not? The integers I talk about are small enough to be represented exactly by doubles. What could be wrong here? – angainor Nov 26 '12 at 12:23
  • @WilliamMorris Why wouldn't it? `(double)iid == id[i]` returns true iff `id[i]` has an integral value, reliably. – Daniel Fischer Nov 26 '12 at 12:23
  • Sometimes, converting an int of 1 to a double gives you 1.00000001. Don't take it for granted, that's a tough bug to track down until you step through line-by-line and question "why didn't that condition trip??" – rutgersmike Nov 26 '12 at 12:47
  • Try `floor (id[i]) == id[i]`; it might be faster since you save one conversion from integer to double. – JohnB Nov 26 '12 at 12:52
  • Re: representing integer with double; http://stackoverflow.com/questions/759201/representing-integers-in-doubles – antak Nov 26 '12 at 12:52
  • @RutgersMike No it doesn't. 1 is completely and exactly representable in a double so if it converts it to 1.0000001 something is badly wrong. – jcoder Nov 26 '12 at 12:58
  • @JohnB No go. Three times slower. And, I have to perform additional cast of `id[i]` to long to use it as index into `x`. – angainor Nov 26 '12 at 12:58
  • @angainor On my machine, the loop takes about 5.5 cycles per iteration, that doesn't look like it's memory bandwidth bounded. Removing the `(double)iid == id[i]` test reduces the running time by about a third, so it looks CPU bounded to me. – Daniel Fischer Nov 26 '12 at 13:53
  • @DanielFischer exactly! just removing this test makes it ~ memory bandwidth bounded. This is essentially what I was wondering: can I do that part better? I can not jump over the memory bandwidth limitation, but maybe I could optimize the cast/bound check. – angainor Nov 26 '12 at 14:25
  • I don't think so. Doing only the `(double)iid == id[i]` check reduces the time by about 15% compared to all three checks. You have the expensive conversion from `long` to `double`, together with the `double` to `long` conversion, and the loop control, that simply takes so much time that a fast memory bus can keep up with it. – Daniel Fischer Nov 26 '12 at 14:37
  • @DanielFischer I guess your are right. Memory access per iteration requires ~4 clock cycles. Assuming latencies are all overlapped, just the two double-long conversions use 4 cycles. On top of that comes the loop control and pointer arithmetic, which I guess sums up to ~5.5 cycles. So I would need to get rid of one conversion, which I can not do. Alternatively, maybe I could use SSE/AVX to speed up the conversions? – angainor Nov 27 '12 at 09:21
  • Maybe, I have no idea. That's lower-level than I know of, I have to admit. – Daniel Fischer Nov 27 '12 at 09:30
  • A pleasure, it's interesting. I wish I knew more about that stuff, but I started too old, I'm learning slow now. – Daniel Fischer Nov 27 '12 at 10:15

2 Answers2

1

EDIT: Disregard if you can't split the arrays!

I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:

struct my_arrays
{
    double x;
    int id;
};

struct my_arrays* arr = malloc(sizeof(my_arrays)*n);

Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.

EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.

My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this: What is the most effective way for float and double comparison?

Community
  • 1
  • 1
rutgersmike
  • 1,183
  • 9
  • 18
  • The code is of course a simplified version. In general, `id` is not the same size as `x`. I can not store those two in the same data structure. I have updated my question. Also, is the hardware prefetching not able to prefetch two streams? That might be the case, I do not know. Finally, I do not compare doubles, I check that a double is in fact an integer. I do not see how comparing to epsilon helps me. – angainor Nov 26 '12 at 13:17
  • Ok, I didn't realize that was the intent of the check. If that's teh case it won't help. For the cache thing, it's certainly HW dependent, but when I run out of things to optimize from the profiler, I turn to caching. If you can't split the arrays, though, it's moot. – rutgersmike Nov 26 '12 at 14:43
0

It might be worth considering examination of double type representation.

For example, the following code shows how to compare a double number greater than 1 to 999:

bool check(double x)
{
    union
    {
        double d;
        uint32_t y[2];
    };
    d = x;
    bool answer;
    uint32_t exp = (y[1] >> 20) & 0x3ff;
    uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
    uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
    if (fraction2 != 0 || fraction1 != 0)
        answer = false;
    else if (exp > 8)
        answer = false;
    else if (exp == 8)
        answer = (y[1] < 0x408f3800); // this is the representation of 999
    else
        answer = true;
    return answer;
}

This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • `uint32_t fraction1 = y[1] << (13 + exp);` will cause undefined behaviour if `exp > 18`. Since exponents are stored biased (with a bias of `-1023` in IEEE754 double precision format, every `x` with an absolute value of at least `1/2^1004` causes UB. Even if you adjust for the bias, the shift will cause UB for `x` of large absolute value (or small, when the shift distance becomes negative). Endianness is another problem, the exponent could land in `y[0]`. – Daniel Fischer Nov 26 '12 at 19:48