Confused by profiler performance results 32-bit vs 64-bit

Question

I have an application that calls a DLL, which in turn may make calls to another DLL.

My problem is performance degradation when these binaries are 64-bit vs. 32-bit.

I have been profiling (AQtime v8.24) using Elapsed Time and CPU Cache Misses counters, and I do not understand the results in a way that helps me know what to do.

So I wrote a test .exe that calls a test DLL, simplifying the code. Initially, the performance degradation existed for these tools (64-bit versions were four times slower than 32-bit), and the CPU Cache Misses test pointed to this routine:

const char* TSimple::get_schema_name( const int schema_number )
{
    char* t_ptr = 0;
    int t_idx;
    for (t_idx = 0; t_idx < 153; t_idx++)
    {
        // THIS ASSIGNMENT IS WHAT WAS SHOWN TO BE A PROBLEM
        bogus_SchemaDef t_def = schema_list[t_idx];
        if (t_def.SchemaNumber == schema_number)
        {
            return (const char*)schema_list[t_idx].SchemaName;
            break;
        }
    }

    return t_ptr;
}

// THIS IS THE bogus_SchemaDef struct:
typedef struct
{
    int SchemaNumber;
    char SchemaName[100];
    char SiteList[100];
} bogus_SchemaDef;

// THIS IS THE schema_list ARRAY (portion):
static bogus_SchemaDef schema_list[] = {
{ 1, "LipUpper", "C000;C003" },
{ 153, "IllDefinedOther", "C420-C424;C760-C765;C767-C768;C770-C775;C778-C779;C809" }
};

So I changed the code to this (eliminated the assignment to an instance of the struct):

const char* TSimple::get_schema_name( const int schema_number )
{
    char* t_ptr = 0;
    int t_idx;
    for (t_idx = 0; t_idx < 153; t_idx++)
    {
        //bogus_SchemaDef t_def = schema_list[t_idx];
        //if (t_def.SchemaNumber == schema_number)
        if (schema_list[t_idx].SchemaNumber == schema_number)
        {
            return (const char*)schema_list[t_idx].SchemaName;
            break;
        }
    }

    return t_ptr;
}

Re-ran the tests, and this time the 64-bit version was 36% faster than 32-bit. Great! Although I don't understand WHY this change made such a difference.

But according to AQtime, the 64-bit version still performs worse than the 32-bit version.

CPU Cache Misses/% Misses
32-bit: 25.79%
64-bit: 83.34%

Elapsed Time/% Time
32-bit: 10.99%
64-bit: 33.95%

I really need to understand what AQtime is telling me, because when I plug this revised test DLL into the environment where my app calls my DLL which then calls this DLL, the overall performance degrades by 30-40%.

I should note that when I test my app+DLL where I am not making the call into the second DLL, the 64-bit builds run as fast or faster than the 32-bit builds. Everything points to this call to any second DLL.

I am overwhelmed by chasing through documentation... confusing myself... and ultimately guessing at code changes that may or may not make any difference.

Hoping for guidance.

I assume that `schema_list` actually contains 153 entries, and not just the two that you show? — Some programmer dude, Sep 01 '15 at 15:18
And remember that when you do the assignment, you *copy* the structure, and of course copying a little over 200 bytes is going to take a few processor cycles. — Some programmer dude, Sep 01 '15 at 15:19
You can't really compare tiny apps. The differences could be completely inconsequential. — Mike Dunlavey, Sep 01 '15 at 16:05
@JoachimPileborg: Yes, there are actually 153 entries in schema_list. Regarding the advice about copying the structure, I get that, but why isn't the problem represented in the 32-bit build as well? — Kathleen, Sep 01 '15 at 16:25
@MikeDunlavey: If testing/comparing with tiny apps is inappropriate, how should I proceed? — Kathleen, Sep 01 '15 at 16:29
@Kathleen: You could try making programs that take more time, from seconds to minutes, and that test different things, like compute-bound vs. I/O-bound, or whatever you want to test. Frankly, I've found the difference between 64 and 32 bit code is inconsequential in performance. What 64 bit gets you is random access to files bigger than 4g (or monster memory). While a faster CPU should give you faster software (if it is CPU-bound), performance tuning is mainly a post-hoc activity, as in [*this example*](http://stackoverflow.com/a/927773/23771). — Mike Dunlavey, Sep 01 '15 at 16:47
Those cache-miss rates are *far* too high. Code like this can only affect program perf when it is executed many times. But it look like you only profiled a single time. The first one, when nothing is in the cache yet. Run it a thousand times instead. That's not a good indicator either, the cache-miss rate will be too low. But at least it is closer to the truth. — Hans Passant, Sep 01 '15 at 16:49
@MikeDunlavey: I do have a much larger example, but simplified it so I could use it to post narrow questions. Bad idea in this case, I guess. Thanks for the link; I'll study it and keep digging. — Kathleen, Sep 01 '15 at 17:54
@HansPassant: The Hit Count for the function call and results shown in the example is 10,000. I recognize that I'm not adept at profiling (although I've learned a lot using the Elapsed Time counter on the 32-bit build), and I'm a C++ programmer who never learned assembly, so the disassembler page of the profiler is beyond my comprehension. But that's my problem. Thanks for the tips. — Kathleen, Sep 01 '15 at 18:00
The link @MikeDunlavey posted gave me some very good ideas. I'm going to rework a bunch of code and see if I can start shaving off some time. If I figure out anything definitive, I'll come back and update this thread. Thanks, everybody! — Kathleen, Sep 01 '15 at 19:01
What type of cache misses are you counting? Is it instruction cache or data cache misses? It's possible that 64bit code shoud miss more into the instruction cache, especially when running your code only a single time. If your entries list has only 153 elements they should easily fit in a 32KB L1 data cache. — VAndrei, Sep 01 '15 at 20:08
@Kathleen: Be careful how you use that technique. The correct sequence is 1) measure the current speed (like with a stopwatch), 2) get samples, 3) revise the code, 4) measure the new speed. What people often do is mix up steps 2 and 3, and then it just doesn't work. I call it "ready-fire-aim" :) — Mike Dunlavey, Sep 01 '15 at 21:08
@MikeDunlavey: Both my real app and the test app wrap an elapsed-time stopwatch around the process. This is a business-rules engine evaluating multi-million record patient cases, and users get impatient. I used AQtime, lots of books and articles about effective use of the STL, and time-sample-revise-repeat to get this code to beat a benchmark for the 32-bit build. What I still wonder, and don't know how to find out, is why compiling my code for 64-bit would make an almost consistent 40% difference. Still digging/exploring. — Kathleen, Sep 01 '15 at 23:34
@Kathleen: Just try the technique I said. It costs nothing. [*Here's a short video of it.*](https://www.youtube.com/watch?v=xPg3sRpdW1U) Books and articles will not find your speedups, and I hate to say it, but even the best profilers easily miss speedups, because they have narrow presuppositions about what the problems consist of. The method also has a [*solid mathematical foundation.*](http://scicomp.stackexchange.com/a/2719/1262) — Mike Dunlavey, Sep 02 '15 at 01:00
@MikeDunlavey: All right, I can afford to burn up to a day trying the random pausing. — Kathleen, Sep 02 '15 at 13:25
@MikeDunlavey: Epic fail: My C++ Builder XE4 IDE always pauses in the CPU window at ntdll.RtlUserThreadStart, nothing shows in the call stack.I cannot figure out how to make my environment look like what is in your video. — Kathleen, Sep 02 '15 at 13:57
@Kathleen: Do you have multiple threads? You may be able to switch to the Worker thread or whatever it's called, and see the call stack there. There must be a way to pause/interrupt it to see what it's doing. What if you have an infinite loop? — Mike Dunlavey, Sep 02 '15 at 14:03
@MikeDunlavey: I'm debugging a DLL, but it is not multi-threaded. When I pause, the Event Log and Thread Status pages in IDE report threads in the calling app (.exe), but none in the DLL. — Kathleen, Sep 02 '15 at 16:12
@Kathleen: Do you have source code for the DLL, and did you build it in debug mode so the debugger can see inside it? I tend to assume you are in the role of a developer, which you sort of need to be, because performance problems are just a style of bug. i.e. can you plant a breakpoint in the DLL and look at variables and the call stack? You need to be able to do that. Then random pausing is just breaking into it at a random time, not a known place. — Mike Dunlavey, Sep 02 '15 at 16:29
@MikeDunlavey: Yes, I wrote the DLL, yes it is a debug build, yes I can set breakpoints and inspect everything. Seems I can't do this when pausing while debugging the app (.exe), either. (BTW are we supposed to take this discussion out of comments? I keep getting warnings from the stackoverflow software about moving this to chat.) — Kathleen, Sep 02 '15 at 17:58
@Kathleen: Sure we can go to chat. I just don't know how :) With your DLL, stick a `while(true);` in it, and then see if you can pause it. — Mike Dunlavey, Sep 02 '15 at 18:12
@MikeDunlavey: I don't know how to go to chat either. The while(true) addition to the DLL code didn't change the experience when hitting pause. I can navigate to that line of code in the IDE, and invoking run-to-cursor puts me into a mode where I can debug, but unfortunately nothing else (e.g., Step) will do that. I'm ready to drop this if you are. — Kathleen, Sep 02 '15 at 18:32
@Kathleen: just created a chat room if you're interested: "Chat w Kathleen & Mike Dunlavey about performance" — Mike Dunlavey, Sep 02 '15 at 18:39
@MikeDunlavey: I found the chat room, but it appears my stackoverflow reputation lacks sufficient creds to participate (requires 20, I have 7). Thanks for trying, Mike. I need to walk away from this for awhile. — Kathleen, Sep 02 '15 at 19:19

Confused by profiler performance results 32-bit vs 64-bit

0 Answers0