13

I used _rdtsc() to time atoi() and atof() and I noticed they were taking pretty long. I therefore wrote my own versions of these functions which were much quicker from the first call.

I am using Windows 7, VS2012 IDE but with the Intel C/C++ compiler v13. I have -/O3 enabled and also -/Ot ("favour fast code"). My CPU is an Ivy Bridge (mobile).

Upon further investigation, it seemed that the more times atoi() and atof() were called, the quicker they executed?? I am talking magnitudes faster:

When I call atoi() from outside my loop, just the once, it takes 5,892 CPU cycles but after thousands of iterations this reduced to 300 - 600 CPU cycles (quite a large execution time range).

atof() initially takes 20,000 to 30,000 CPU cycles and then later on after a few thousand iterations it was taking 18 - 28 CPU cycles (which is the speed at which my custom function takes the first time it is called).

Could someone please explain this effect?

EDIT: forgot to say- the basic setup of my program was a loop parsing bytes from a file. Inside the loop I obviously use my atof and atoi to notice the above. However, what I also noticed is that when I did my investigation before the loop, just calling atoi and atof twice, along with my user-written equivalent functions twice, it seemed to make the loop execute faster. The loop processed 150,000 lines of data, each line requiring 3x atof() or atoi()s. Once again, I cannot understand why calling these functions before my main loop affected the speed of a program calling these functions 500,000 times?!

#include <ia32intrin.h>

int main(){
    
    //call myatoi() and time it
    //call atoi() and time it
    //call myatoi() and time it
    //call atoi() and time it

    char* bytes2 = "45632";
    _int64 start2 = _rdtsc();
    unsigned int a2 = atoi(bytes2);
    _int64 finish2 = _rdtsc();
    cout << (finish2 - start2) << " CPU cycles for atoi()" << endl;
    
    //call myatof() and time it
    //call atof() and time it
    //call myatof() and time it
    //call atof() and time it
    
    
    //Iterate through 150,000 lines, each line about 25 characters.
    //The below executes slower if the above debugging is NOT done.
    while(i < file_size){
        //Loop through my data, call atoi() or atof() 1 or 2 times per line
        switch(bytes[i]){
            case ' ':
                //I have an array of shorts which records the distance from the beginning
                //of the line to each of the tokens in the line. In the below switch
                //statement offset_to_price and offset_to_qty refer to this array.

            case '\n':
                
                switch(message_type){  
                    case 'A':
                        char* temp = bytes + offset_to_price;
                        _int64 start = _rdtsc();
                        price = atof(temp);
                        _int64 finish = _rdtsc();
                        cout << (finish - start) << " CPU cycles" << endl;
                        //Other processing with the tokens
                        break;

                    case 'R':
                        //Get the 4th line token using atoi() as above
                        char* temp = bytes + offset_to_qty;
                        _int64 start = _rdtsc();
                        price = atoi(temp);
                        _int64 finish = _rdtsc();
                        cout << (finish - start) << " CPU cycles" << endl;
                        //Other processing with the tokens
                        break;
                }
            break;
        }
    }
}

The lines in the file are like this (with no blank lines in between):

34605792 R dacb 100

34605794 A racb S 44.17 100

34605797 R kacb 100

34605799 A sacb S 44.18 100

34605800 R nacb 100

34605800 A tacb B 44.16 100

34605801 R gacb 100

I am using atoi() on the 4th element in the 'R' messages and 5th element in 'A' messages and using atof() on the 4th element in the 'A' messages.

Community
  • 1
  • 1
user997112
  • 29,025
  • 43
  • 182
  • 361
  • 1
    This is interesting, but would be even more if you could provide sample code that demonstrates this. – SirDarius Oct 12 '13 at 13:13
  • @SirDarius the code is rather extensive. I have just started to add a template so that you guys can visually see what I am describing, will try to add some more. – user997112 Oct 12 '13 at 13:15
  • 1
    Also if this is related to the standard library implementation, could you please provide the compiler and the platform where this happens ? – SirDarius Oct 12 '13 at 13:17
  • @SirDarius added details in bold at the top – user997112 Oct 12 '13 at 13:20
  • Maybe they're being optimized as pure functions? GCC allows you to mark functions as [pure](http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html) for optimization purposes for example. – tangrs Oct 12 '13 at 13:21
  • Maybe related to something like this: http://stackoverflow.com/questions/16533572/does-gcc-use-intels-sse-4-2-instructions-for-text-processing-if-available ? If this is the intel compiler here, there might be specific optimisations in recent CPU architectures... What is your CPU ? – SirDarius Oct 12 '13 at 13:25
  • @SirDarius I am using an Ivy Bridge mobile CPU. – user997112 Oct 12 '13 at 13:28
  • 1
    Uhm, in general executing the same piece of code many times tends to speed it up, since its code gets stored in cache, the branch predictor learns to predict correctly its branches & co., but I don't know if this could account for such a speedup. – Matteo Italia Oct 12 '13 at 13:30
  • The first time you call it, it won't be in the code cache (or any other cache) yet. But the difference looks bigger than that. – harold Oct 12 '13 at 13:31
  • 2
    Does the timing include the file reads? – doctorlove Oct 12 '13 at 13:33
  • 3
    You haven't shown us how you're doing the measurements in the loop, and what you're using for a "control", so it's hard to say if what you're seeing is real. I have difficulty believing that there could be any advantage to caching within atoi itself. – Hot Licks Oct 12 '13 at 13:33
  • @everyone I have tried to add some more code to explain what I am doing. – user997112 Oct 12 '13 at 13:42
  • @doctorlove No I am literally timing either side of the atof/atoi call (see above example code). – user997112 Oct 12 '13 at 13:42
  • 5
    If your implementation of `atoi` or `atof` is much faster than the standard library versions (which have been around for decades and intensively used and refined), chances are that yours are doing less work than the standard library versions; often this kind of thing comes from not handling edge cases correctly. – Pete Becker Oct 12 '13 at 13:43
  • @PeteBecker yes mine are customised to my scenario- I am fine with that. But what I don't get is why the change in execution speed of atoi/atof and also the wildly varying range of execution times once it has optimized- the data is very consistent (same number of decimal places etc) so surely after a few thousand lines atoi() should not be executing in 300 to 600 cycles (atof does seem to stablilise around 28 cycles). – user997112 Oct 12 '13 at 13:44
  • The short answer is: Of course they don't. You just wrote a flawed microbenchmark, and like thousands of programmers before you who also didn't appreciate the subtleties of microbenchmarking, got completely the wrong idea. – Boann Oct 12 '13 at 14:17
  • @Boann yeh course- 30,000 CPU cycles vs 28 CPU cycles is just a flawed micro-benchmarking subtlety... – user997112 Oct 12 '13 at 18:25

4 Answers4

7

I'm guessing the reason why you see such a drastic improvement for atoi and atof, but not for your own, simpler function, is that the former have a large number of branches in order to handle all the edge cases. The first few times, this leads to a large number of incorrect branch predictions, which are costly. But after a few times, the predictions get more accurate. A correctly predicted branch is almost free, which would then make them competitive with your simpler version which doesn't include the branches to begin with.

Caching surely also important, but I don't think that explains why your own function was fast from the beginning, and did not see any relevant improvements after repeated execution (if I understand you correctly).

amaurea
  • 4,950
  • 26
  • 35
  • 1
    In a sense, branch prediction is a form of caching. (It is caching behavior rather than caching data.) – Raymond Chen Oct 12 '13 at 14:37
  • Yes, although it's far more efficient than simple caching as it inherently relies on program context and history. The same branch can have multiple stored predictions based on that. Oh, and cache replacement policy is also a form of behavioral caching :) – Leeor Oct 12 '13 at 14:53
4

Using RDTSC for profiling is dangerous. From the Intel processor manual:

The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC.

With the inevitable Heisenberg effect that causes, you'll now measure the cost of RDTSCP or LFENCE. Consider measuring a loop instead.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Interesting that an lfence is considered enough to mitigate that, it's not really serializing execution, only loads (and even that only applies for observability, not actual memory latency). But measuring multiple instances should indeed be enough to handle that. – Leeor Oct 12 '13 at 15:00
3

Measuring performance for a single call like this isn't advisable. You'd get too many variance due to power throttles, interrupts and other OS/system interferences, measurement overhead, and as said above - cold/warm variance. On top of that, rdtsc is no longer considered a reliable measurement since your CPU may throttle its own frequency, but for the sake of this simple check we can say it's good enough.

You should run your code at least several thousands of times, discard some portion at the beginning, and then divide to get the average - that would give you the "warm" performance, which would include (as mentioned in the comments above) close caches hit latency for both code and data (and also TLBs), good branch prediction, and might also negate some of the external impacts (such as having only recently woken up your CPU from a powerdown state).

Of course, you may argue that this performance is too optimistic because in real scenarios you won't always hit the L1 cache etc.. - it may still be fine for comparing two different methods (such as competing with the library ato* functions), just don't count on the results for real life. You can also make the test slightly harder, and call the function with a more elaborate pattern of inputs that would stress the caches a bit better.

As for your question regarding the 20k-30k cycles - that's exactly the reason why you should discard the first few iterations. This isn't just cache-miss latency, you're actually waiting for the first instructions to do a code fetch, which may also wait for the code page translation to do a page walk (a long process that may involve multiple memory accesses), and if you're really unlucky - also swapping in a page from disk, which requires OS assistance and lots of IO latency. And this is still before you started executing the first instruction.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • But what is strange is that my versions of atoi()and atof() work fast from the very first execution and their execution time is constant (its always 21 clock cycles for atoi and 28 clock cycles for atof). – user997112 Oct 12 '13 at 14:45
  • It's hard to tell without debugging this scenario (using vtune or a simulator), but quite probably some other instruction (that's not being measured) "paid" the starting penalty for them. Either way - it's only the steady state that can tell you anything meaningful. – Leeor Oct 12 '13 at 14:55
2

The most likely explanation is that because you are calling atoi/atof so often, it is being identified as a hot spot and thus being kept in the Level 1 or Level 2 processor code cache. The CPU's replacement policy -- that microcode that determines what cache lines can be purged when a cache miss occurs) would tag such a hot spot to be kept in cache. There's a decent write up of cpu caching technlogies on wikipedia, if you are interested.

Your initial timings were low because your code wasn't in the CPU's most performant cache yet, but once invoked some number of times, were.

Jeff Paquette
  • 7,089
  • 2
  • 31
  • 40
  • 5
    cache replacement isn't done by microcode, it's a pure HW mechanism. It also doesn't have to get identified explicitly, its just promoted on every usage so it never gets evicted. – Leeor Oct 12 '13 at 13:43