6

I have double x, and double y. I need to turn that into int boxnum, which is defined as the (floored) index in which (x,y) falls in the WIDTH x HEIGHT grid with squares of size BOX_SIZE. Coordinates in excess of WIDTH are wrapped back around; ditto for HEIGHT.

I am currently using:

( (((int)(x))/BOX_SIZE)%WIDTH+ WIDTH*((((int)(y))/BOX_SIZE)%HEIGHT) )

This statement currently eats something like 20% of my execution time, and that gets even worse (on the order of 40-50%) if I make it completely safe for negative coordinates:

( (( ((int)(x)) /BOX_SIZE)%WIDTH+WIDTH)%WIDTH
    +WIDTH*(( (((int)(y)) /BOX_SIZE)%HEIGHT+HEIGHT)%HEIGHT) )

I'm actually considering completely converting the application to fixed-point, just to avoid this, so that I can bitmask out the part I want instead of having this horrible conversion.

Is there a better way of doing a double->int conversion of this kind? Would it be worth it to ensure that 0<x<WIDTH*BOX_SIZE and 0<y<HEIGHT*BOX_SIZE so I can drop the two remainder operations? (doing so is difficult enough to not be worth while for the benchmark, unless it is likely to be a significant improvement)

EDIT: After appropriate chastisement in the comments, more details:

x and y are the coordinates of a set of (as many as 10^6) particles. I am using an algorithm which requires me to, at every time step, do some simple sums across all particles within a box. Thus, I loop across particles, calculate which box the particle is in, and then use that as the array index for adding to that box. Particles often move far enough that their past locations is no indication of their future ones. The are also unordered, which means I can make no assumptions about this.

WIDTH, HEIGHT, and BOX_SIZE are technically free, as long as WIDTH and HEIGHT are even multiples of BOX_SIZE. In practice they are all specified compile time, and are integers with BOX_SIZE=1. I have run everything from WIDTH=HEIGHT=4 to WIDTH=HEIGHT=512, and while I'm usually on a square power of 2 (because why not?), WIDTH=37;HEIGHT=193 should work without issues.

This calculation is unavoidable performed once per particle per timestep; in current implementation it is performed twice. I tried caching the value to avoid recalculation, but the end benchmark performed worse, so I went back to calculating it twice.

A basic test run with 10 particles/box * 100 WIDTH * 100 HEIGHT* 10000 steps = 1 billion particle*timesteps runs in a shade over a minute.

These coordinates are on the order of their "regular numbers" (1-1000), so I'm nowhere near any kind of bound on double.

zebediah49
  • 7,467
  • 1
  • 33
  • 50
  • What's the larger algorithm that this is a part of? How are x and y generated or specified? Do you have any control over the (presumably) constant values of BOX_SIZE, WIDTH, and HEIGHT? Not that converting to fixed point won't necessarily allow you to change this to a masking operation – jerry May 09 '13 at 20:31
  • I think this is a similar question as @jerry 's first question, just worded slightly differently: Are you performing the expensive operations on every number handed to the function? Or do you only do the double remainder for x when x is negative etc? – n0741337 May 09 '13 at 20:42
  • 3
    Performance issues like this cannot be efficiently diagnosed in the dark. Relevant factors include whether WIDTH and HEIGHT are compile-time constants, their specific values, the frequency with which you do these calculations relative to other code, whether there are any patterns in your indexing (such as traversing columns or diagonals) that could be used to hoist calculations out of a loop, and the range of potential values of x and y relative to their type (Is there enough room to add a multiple of WIDTH so that the first `%` can be eliminated?). – Eric Postpischil May 09 '13 at 21:24
  • Sorry if my comment came across as chastising, that wasn't my intent. I'm guessing you're compiling with a high optimization level, so if you could save cycles by picking powers of two and shifting/masking, your compiler probably would have already done that. Just in case, though, have you tested out explicitly shifting and masking? – jerry May 10 '13 at 09:47
  • What do you mean by “… they… are integers with `BOX_SIZE=1`”? – Eric Postpischil May 10 '13 at 13:20
  • You may find [this](http://stackoverflow.com/questions/17035464/fast-method-to-round-double-to-32bit-int-explained) interesting –  Jun 11 '13 at 11:22

2 Answers2

7

The problem with your code is that an (int) cast causes the floating-point unit's rounding mode to change from the IEEE754 default round to nearest to the C standard Round toward zero or 'truncate' as it is defined in the standard.

See the gcc docs here for more information on IEEE754 rounding modes.

On a modern deeply pipelined processor, the whole pipeline must be flushed when the rounding mode is changed, causing a massive slowdown as the pipleline is emptied on every (int) cast. When you are doing this in a loop the slowdowns you are experiencing are typical.

There is a very interesting article on this issue by Erik de Castro Lopo (author of libsndfile and secret rabbit code). In his audio conversion routines floating point rounding performance is critical, and he provides an interesting set of solutions to this problem using the POSIX lrintf() call, as well as some x86 assembly for non-POSIX platforms.

The article can be found here.

The short answer is to use the C99/POSIX lrintf() function, or use some inline assembly to perform the integer truncation without changing the floating-point rounding mode.

  • WOW I did not know that caused so many issues; I just saw a huge penalty there when profiled with `callgrind`, and which increased a lot when I switched to doing more modular arithmetic, so I assumed that was the problem. I'll try looking into this, thanks. – zebediah49 May 10 '13 at 02:45
  • 5
    This answer is incorrect. Processors typically have instructions for converting floating-point to integer without using the floating-point rounding mode, such as the IA-32 instruction CVTTSD2SI or the older FISTTP. – Eric Postpischil May 10 '13 at 03:01
  • I just discovered that Eric is correct. As cool as that was to read, I found that it was originally using `cvttsd2si`, and when I changed it to use `lrint()`, the produced asm used `cvtsd2si`. The difference appears to be how signed numbers work, with no real difference in performance. – zebediah49 May 10 '13 at 03:25
  • I wasn't aware of `cvttsd2si`. On reading the ASM docs it seems that @EricPostpischil is on the right track. I'd be very interested to see a SSCCE, and investigate this on my own machine. –  May 10 '13 at 10:46
3

Division and Remainder

An issue that has been hinted at in the comments is that division (and/or the remainder operation) can be expensive. It is not uncommon for division to take some tens of processor cycles, in comparison to a single cycle for addition and multiplication.

The easiest way to avoid this expense may be to make WIDTH and HEIGHT compile-time constants that are powers of two. This allows the compiler to change the remainder operations using % WIDTH or % HEIGHT to fast bit-mask operations. Similarly, if BOX_SIZE is a compile-time constant that is a power of two, it allows the compiler to change the division to a bit shift.

It is also the reason for my comment alluding to changing ((int) x / BOX_SIZE % WIDTH + WIDTH) % WIDTH to ((int) x / BOX_SIZE + Number) % WIDTH, where Number is some multiple of WIDTH such that the sum is guaranteed to be non-negative. This eliminates a remainder operation. (However, you were proposing this expression to handle negative coordinates, and it may have a flaw: (int) x / BOX_SIZE rounds the quotient toward zero, which may give the wrong box number for negative x. So you might need to fix this expression before we consider the optimization aspects.)

Other

Often, I would suspect cache and imprecise attribution of processor time to source code as reasons that the index calculation appears to take 20% of execution time. The index calculations you show have no cache repercussions, as they do not access memory. However, compiled code often results in interleaved instructions: Each source code statement results in the generation of multiple instructions, and the instructions for different statements are interleaved for various reasons, rather than appearing as all the instructions for one statement, then all the instructions for another statement, and so on. This interleaving makes it difficult for software reporting on performance to indicate precisely where processor time is being consumed.

There are other effects that interfere with such measurement. Some measurement is performed by sampling: The processor is interrupted at intervals, and the value of the program counter at the time of the interrupt is recorded. This shows you were the processor was waiting for something but not what it was waiting for. For example, if the conversion of x to int was waiting for a floating-point unit to be available but no unit was available because a previous instruction was performing an addition of completely unrelated data, then the fact that 20% of the samples appear to be in (int) x is misleading.

The fact that you are operating on a million particles is consistent with some data access causing cache thrashing and slowing performance. On the other hand, the fact that adding more remainder operations (for the support of negative coordinates) makes the index calculation appear to consume more time counter-indicates a cache problem.

However, it would be unusual for these index calculations to consume a large portion of a program’s time unless the program is doing little other work.

It might help if you could show self-contained compilable code that demonstrates the issue.

Community
  • 1
  • 1
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312