0

i'm writing a graphics engine as an assignment for university and recently tried to optimize a part of my code, however the optimization seems to slow it down instead.

This specific part of code processes 2D Lindenmayer systems and turns them into a list of "line2D" objects that can be processed into an image by another part of the program.

In doing that it uses a sin and cos to calculate the coordinates of the next point, and because sin and cos are floating point operations i figured these would be time intensive, especially in more complex lindenmayer systems. so I created an object class "cossinlist" which imports the valus of cos and sin from a .txt file for every integer angle between 0 and 359 degrees (transformed to rad) to two maps called "coslist" and "sinlist" with the angle as key. That way I'd only have to do the actual flops when dealing with an angle that contains a decimal part.

Then i decided to measure the execution time with the optimization and without it (by commenting it out) on a relatively intensive system: with it the engine generated the image in 33.4016 seconds and without it only took 25.3686 seconds. this is a substantial difference, however not in the expected way. I did more tests and all of them gave similiar proportions of difference, so now im wondering... What causes this difference?

The function:

img::EasyImage LSystem2D(const unsigned int size, const ini::DoubleTuple & backgroundcolor, LParser::LSystem2D & System, const ini::DoubleTuple & color)
{
    CosSinList cossinlist;
    std::string string;
    Lines2D Lines;
    double origin = 0;
    Point2D currentpos(origin, origin);
    Point2D newpos(origin, origin);
    std::stack<Point2D> savedpositions;
    double currentangle = System.get_starting_angle();
    std::stack<double> savedangles;
    const img::Color linecolor(color.at(0)*255,color.at(1)*255,color.at(2)*255);
    const img::Color BGcolor(backgroundcolor.at(0)*255,backgroundcolor.at(1)*255,backgroundcolor.at(2)*255);
    string = ReplaceLsystem(System, (System.get_initiator()), (System.get_nr_iterations()));
    bool optimizedangle = false;
    if(System.get_angle() == rint(System.get_angle()) && (System.get_starting_angle() == rint(System.get_starting_angle()))
    {
        optimizedangle = true;
    }
    for(char& c : string)
    {
        if(currentangle > 359){currentangle -= 360;}
        if(currentangle < -359){currentangle += 360;}
        if(System.get_alphabet().count(c) != 0)
        {
            /*if(optimizedangle == true)
            {
                if(currentangle >= 0)
                {
                    newpos.X = currentpos.X+(cossinlist.coslist[currentangle]);
                    newpos.Y = currentpos.Y+(cossinlist.sinlist[currentangle]);
                }
                else
                {
                    newpos.X = currentpos.X+(cossinlist.coslist[360+currentangle]);
                    newpos.Y = currentpos.Y+(cossinlist.sinlist[360+currentangle]);
                }
            }
            else
            {*/
                newpos.X = currentpos.X+cos(currentangle*PI/180);
                newpos.Y = currentpos.Y+sin(currentangle*PI/180);
            //}
            if(System.draw(c))
            {
                Lines.push_back(Line2D(currentpos,newpos,linecolor));
                currentpos = newpos;
            }
            else
            {
                currentpos = newpos;
            }

        }
        else if(c=='-')
        {
            currentangle -= System.get_angle();
        }
        else if(c=='+')
        {
            currentangle += System.get_angle();
        }
        else if(c=='[')
        {
            savedpositions.push(currentpos);
            savedangles.push(currentangle);
        }
        else if(c==']')
        {
            currentpos = savedpositions.top();
            savedpositions.pop();
            currentangle = savedangles.top();
            savedangles.pop();

        }
    }
    return Drawlines2D(Lines, size, BGcolor);
}

The SinCosList class:

#include <fstream>
#include <iostream>
#include <map>
#include "CosSinList.h"
using namespace std;

CosSinList::CosSinList()
{
    string line;
    std::fstream cosstream("coslist.txt", std::ios_base::in);
    double a;
    double i = 0;
    while (cosstream >> a)
    {
        coslist[i] = a;
        i += 1;
    }
    std::fstream sinstream("sinlist.txt", std::ios_base::in);
    i = 0;
    while (sinstream >> a)
    {
        sinlist[i] = a;
        i += 1;
    }
};

CosSinList::~CosSinList(){};

The "optimization" is commented out in the same way i commented it out during the speed test, only the actual use of the object is commented out (the SinCosList is still being initialized and the boolean that checks if it can be used is also still being initialized)

horns
  • 1,843
  • 1
  • 19
  • 26
  • 1
    What compiler did you use, and what optimization settings did you give it, and did you warm up your caches, and did you run the benchmark more than once? – Dietrich Epp Mar 14 '15 at 11:52
  • 4
    You're making the rather naive assumption that floating point operations are slow, and that accessing more memory (putting more pressure on the CPU caches) would be faster than just asking the CPU to compute cos or sin. Your code isn't faster because it's asking the CPU to do something that's less efficient on a modern CPU. – jalf Mar 14 '15 at 11:59
  • @Ditrich Epp I used the g++ compiler with these flags: -c -fmessage-length=0 -std=c++11, i don't really know what you mean with warming up my caches and yes i ran it more than once. – Felix Neijzen Mar 14 '15 at 12:07
  • why can't we just use an array to do LUT? – user3528438 Mar 14 '15 at 12:09
  • This could be something to do with the pipeline. With the if statement instructions can't be pipelined so easily. (There's a good explanation on SO for this but I don't remember where.) – QuentinUK Mar 14 '15 at 12:25
  • See See http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – QuentinUK Mar 14 '15 at 12:32
  • So @QuentinUK if i understand correctly, the added code for the optimization is causing the compiler to create a pipeline where it continually guesses wrong, and the decreased performance isnt reading the map versus flop, but rather the pipelining? That's very interesting, i'll try to rewrite the if statements. – Felix Neijzen Mar 14 '15 at 12:49
  • 1
    I'm going to hazard a guess that modern CPUs are pretty well optimized for doing cos and sin – M.M Mar 14 '15 at 12:59
  • 3
    @FelixNeijzen Your compiler flags do not include any optimization flags like `-O3`? – dyp Mar 14 '15 at 13:36
  • 1
    A `std::map` is typically implemented as a rb-tree, which requires several indirections to find a value. Additionally, a rb-tree requires about 3 words overhead for navigation per node, making it rather memory-inefficient to store floating point key-value pairs. Try using an array. – dyp Mar 14 '15 at 13:42
  • 5
    IMHO all questions about optimisation should be closed immediately if they are being compiled without optimisations (i.e. for GCC -O2 or similar). It's a complete waste of everyone's time to talk about performance of C++ code if you don't even enable function inlining, you're just measuring the abstraction penalty. – Jonathan Wakely Mar 14 '15 at 14:18
  • 1
    @FelixNeijzen: Check whether your platform offers a `sincos()` function for the simultaneous computation of sine and cosine of the same angle. This would likely be faster than separate calls to `sin()` and `cos()`. Since your angle contains the factor π, you may also want to check whether your platform offers any of the functions `sinpi()`, `cospi()`, `sincospi()` for the computation of the sine and cosine of πx. Note this is functionality beyond standard C++ so its use may not be indicated if portability is important. – njuffa Mar 14 '15 at 14:29
  • 1
    Optimization question mentions actual time +1. Provides enough code to reproduce -2 (what is your map type?) Provides code +1 is simplified code -2. Includes flags +1 optimizes -2. Look: almost every optimization C++ question here asks for the same things, and they are not here. – Yakk - Adam Nevraumont Mar 14 '15 at 16:43

1 Answers1

5

(I'm assuming coslist and sinlist are ordinary arrays or similar)

Some things:

  • You really should turn optimizations on

With optimizations off, you're measuring stuff that doesn't matter. Performance of unoptimized code correlates poorly with the performance once optimizations are turned on.

  • optimzedangle should be a compile-time constant.

The optimizer is likely to be able to simplify code if it knows that optimizedangle doesn't change throughout the run of this program. With this particular code snippet it can probably figure it out, but you shouldn't rely on that if you don't have to, and in general it's very easy to accidentally write code where you think it's obvious that a variable remains constant, but the compiler is smarter than you and realizes you've opened a loophole that could potentially allow the variable to change, and so it has to write slower loop to account for that.

  • Branching can be bad

Branching in an inner loop -- especially unpredictable branching -- can kill performance. Try to write your loop so there aren't any branches; e.g. ensure that currentangle is always positive, or maybe make the lookup table 720 entries long so you can always just index 360 + currentangle.

  • Floating point <--> integer conversions can be slow

I tend to avoid these and consequently I've never been good at predicting when it really is an issue, but it's possible that this is what's really killing you.

  • Your table is consuming cache

You don't post your data structure, but I'm imagining around 6k bytes. That's a nontrivial percentage of your L1 cache. It's nonobvious to me whether that's an important effect in this loop.