3

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.

  1. A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?

    • eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
  2. How much time do different mathematical calculations take

    • eg: (3+8) vs (5x4) vs (log(8)) etc.
  3. How much time do inequality checks take?

    • eg: >, <, =
Griffin
  • 2,399
  • 7
  • 48
  • 83
  • 11
    Profile, *then* optimize. In that order. – n0rd Dec 14 '11 at 08:19
  • It probably depends very much on problem complexity whether it makes sense to "weed out" unnecessary operations. For example, if you have only a few lines you want to check for intersection, YOUR CODING TIME will be much longer if you try to figure out which lines are close together. – arne Dec 14 '11 at 08:23
  • This depends entirely on the compiler and the target architecture. – Mankarse Dec 14 '11 at 08:27
  • Take a look at this related question: http://stackoverflow.com/questions/2702407/approximate-number-of-cpu-cycles-for-various-operations. – OSH Dec 14 '11 at 08:27

6 Answers6

4
  1. You'll have to do profiling.

  2. Basic operations, like additions or multiplications should only take one asm instructions.

    EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.

    Logarithms take longer.

  3. Also one asm instruction.

Unless you profile your code, there's no way to tell where your bottlenecks are.

Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.

You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.

You might also want to try dynamic programming or caching.

Shog9
  • 156,901
  • 35
  • 231
  • 235
Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
  • not all `asm` instructions take the same time (clock ticks). for example + and * usually doesn't take the same time. for one action you won't see the difference cause it's to fast. but having million instruction of each may take noticeable different time. – Roee Gavirel Dec 14 '11 at 08:25
  • Multiplications are very commonly expanded into additions and bit shifts. Therefore the one operation = one asm definitely doesn't apply. – Šimon Tóth Dec 14 '11 at 08:26
  • Actually intel has the imul operation, so it does only take one instruction. – Luchian Grigore Dec 14 '11 at 08:28
  • Multiplication can be an order of magnitude slower than addition. Division goes into two orders of magnitude (up to 100 times slower) – Sam Dec 14 '11 at 08:36
  • @vasile I wasn't talking about division. Also, how can it be an order of magnitude slower if it takes one asm instruction? – Luchian Grigore Dec 14 '11 at 08:37
  • 1
    some instructions expand to microcode functions. Check `wiki microcode` for that. And also check http://www.agner.org/optimize/instruction_tables.pdf and look for IMUL. It takes 11 clocks, while add takes 1 clock. – Sam Dec 14 '11 at 08:41
  • I came here because I was curious about OpenGL; since I imagine it needs to multiply every single pixel in a texture by all the transformations; and a 1000x1000 pixel texture has 1 million pixels; so at 60 fps, it would need to do 60 *1000000 vector multiplications per second. but i imagine since the instructions are already cached, the only slow part of doing this is loading different sections of the bitmap(since i hear 64k is a lot of cache memory, and 1mb image wont fit). O.o cpus are fast. – Dmytro Jan 05 '17 at 20:51
2

Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf

1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.

2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.

3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.

That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.

stativ
  • 1,452
  • 12
  • 23
2

As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions. However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.

More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.

However, as @Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.

As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.

Vlad
  • 18,195
  • 4
  • 41
  • 71
1
  1. This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
  2. You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
  3. Same as above - you have to test.
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
1

You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.

Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.

As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
0

as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like

  1. take time stamp A
  2. make 1,000,000 "+" calculation (or any other).
  3. take time stamp B
  4. calculate the diff between A and B.

then you can compare the calculations.

take in mind:

  1. using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
  2. the compiler optimization may change it.
  3. each processor is doing it differently.
Roee Gavirel
  • 18,955
  • 12
  • 67
  • 94