Why does this C++ function takes 4 time as much as a c function

Question

I am considering using c++ for a performance critical application. I thought both C and C++ will have comparable running times. However i see that the c++ function takes >4 times to run that the comparable C snippet.

When i did a disassemble i saw the end(), ++, != were all implemented as function calls. Is it possible to make them (at least some of them) inline?

Here is the C++ code:

typedef struct pfx_s {
    unsigned int start;
    unsigned int end;
    unsigned int count;
} pfx_t;

typedef std::list<pfx_t *> pfx_list_t;

int
eval_one_pkt (pfx_list_t *cfg, unsigned int ip_addr)
{
    const_list_iter_t iter;

    for (iter = cfg->begin(); iter != cfg->end(); iter++) {
        if (((*iter)->start <= ip_addr) &&
            ((*iter)->end >= ip_addr)) {
            (*iter)->count++;
            return 1;
        }
    }
    return 0;
}

And this is the equivalent C code:

int
eval_one_pkt (cfg_t *cfg, unsigned int ip_addr)
{
    pfx_t *pfx;

    TAILQ_FOREACH (pfx, &cfg->pfx_head, next) {
        if ((pfx->start <= ip_addr) &&
            (pfx->end >= ip_addr)) {
            pfx->count++;
            return 1;
        }
    }
    return 0;
}

Did you compile with optimizations on? It should inline them when you turn optimizations on. — Jesse Good, Jun 22 '13 at 21:48
C++ does that to allow for operator overloading. A good optimizer should be able to tell if you're not in fact overloading them and take out the calls. — Lee Daniel Crocker, Jun 22 '13 at 21:50
The stats that you report seem to imply that you timed debug version of the code, probably with debug version of the library. Such measurements make little or no sense at all. — AnT stands with Russia, Jun 22 '13 at 21:52
The only two areas I see that _might_ benefit from manual optimization is using a local variable to hold the end iterator instead of using `cfg->end()` and changing `iter++` to `++iter`. — Captain Obvlious, Jun 22 '13 at 21:52
actually that is correct. with optimized compiler -O3 on it is close to C (~12% slower than C). Should have thought of it — helloworld, Jun 22 '13 at 21:56
It is more than 12%. With C optimized version it took 21 secs whereas the C++ optimized version took 36 secs. The C structure uses the sys/queue.h macros. typedef struct pfx_s { unsigned int start; unsigned int end; unsigned int count; TAILQ_ENTRY(pfx_s) next; } pfx_t; — helloworld, Jun 22 '13 at 22:06
_"Is it possible to make them (at least some of them) inline?"_ Yes, but if you don't compile with optimisations enabled then they won't be inlined. **DON'T BENCHMARK UNOPTIMIZED CODE.** C++ lets you write at a high level, so if you don't optimise then there is extra overhead, but the beauty of the language is that overhead can often be optimised away entirely. If your application is performance critical you really need to optimise. Really. — Jonathan Wakely, Jun 22 '13 at 22:33
I just so hate these rigged comparisons claiming "equivalent" that are so obviously not. And even more hate performance-related entries with unoptimized builds -- too bad folks on meta considered it a minority problem — Balog Pal, Jun 22 '13 at 23:02
_"I am considering using c++ for a performance critical application."_ But you didn't optimise it, so you're either not qualified to write performance-critical code or must be trolling to wind up the [tag:c++] folks. Not constructive. — Jonathan Wakely, Jun 22 '13 at 23:21

score 7 · Accepted Answer · answered Jun 22 '13 at 21:55

7

It might be worth noting that the data structures you used are not entirely equivalent. Your C list is implemented as a list of immediate elements. Your C++ list is implemented as a list of pointers to actual elements. Why did you make your C++ list a list of pointers?

This alone will not, of course, cause four-fold difference in performance. However, it could affect the code's performance do to its worse memory locality.

I would guess that you timed debug version of your code, maybe even compiled with debug version of the library.

answered Jun 22 '13 at 21:55

AnT stands with Russia

312,472
42
525
765

3

"When i did a disassemble i saw the end(), ++, != were all implemented as function calls." - I think that counts even more. – Jun 22 '13 at 22:01
1

yes. with 1) compiler optimization turned on and 2) making the C list a pointer to the structure the performance is pretty close 34.5 for C vs 35.8 for C++ – helloworld Jun 22 '13 at 22:20
Actually, I wouldn't be surprised to see that the cache unfriendliness of a list of pointers caused four-fold differences in performance. Furthermore, given that pfx_s is only 3 integers long, it'd be also interesting to see what happens (for good or bad) with a substantially bigger struct; say, as big as a cache line, or more. – hmijail Aug 12 '15 at 13:30

Jerry Coffin · Answer 2 · 2013-06-22T22:17:18.460

Do you have a really good reason to use a list here at all? At first glance, it looks like a std::vector will be a better choice. You probably also don't want a container of pointers, just a container of objects.

You can also do the job quite a bit more neatly a standard algorithm:

typedef std::vector<pfx_t> pfx_list_t;

int
eval_one_pkt(pfx_list_t const &cfg, unsigned int ip_addr) {
    auto pos = std::find_if(cfg.begin(), cfg.end(),
        [ip_addr](pfx_t const &p) {
            return ip_addr >= p.begin && ip_addr <= p.end;
        });

    if (pos != cfg.end()) {
       ++(pos->count);
       return 1;
    }
    return 0;
}

If I were doing it, however, I'd probably turn that into generic algorithm instead:

template <class InIter>
int
eval_one_pkt(InIter b, InIter e, unsigned int ip_addr) {
    auto pos = std::find_if(b, e,
        [ip_addr](pfx_t const &p) {
            return ip_addr >= p.begin && ip_addr <= p.end;
        });

    if (pos != cfg.end()) {
       ++(pos->count);
       return 1;
    }
    return 0;
}

Though unrelated to C vs. C++, for a possible slight further optimization on the range check you might want to try something like this:

return ((unsigned)(ip_addr-p.begin) <= (p.end-p.begin));

With a modern compiler with optimization enabled, I'd expect the template to be expanded inline entirely at the point of use, so there probably wouldn't be any function calls involved at all.

Thanks for the answer. list is a better choice because pfx_ts can come and go. I used object pointer because object can be in multiple lists — helloworld, Jun 23 '13 at 00:17
@helloworld: [list still probably isn't a better choice](http://stackoverflow.com/a/9765090/179910). Fair enough about using a pointer. — Jerry Coffin, Jun 23 '13 at 00:32
thanks for the pointer on list vs vectors.yes i agree with your reasoning — helloworld, Jun 24 '13 at 18:42

score 4 · Answer 3 · answered Jun 22 '13 at 22:23

I copied your code and ran timings of 10,000 failed (thus complete) searches of 10,000 element lists:

Without optimization:

TAILQ_FOREACH 0.717s
std::list<pfx_t *> 2.397s
std::list<pfx_t> 1.98s

(Note that I put a next into pfx_t for TAILQ and used the same redundant structure with std::list)

You can see that lists of pointers is worse than lists of objects. Now with optimization:

TAILQ_FOREACH 0.467s
std::list<pfx_t *> 0.553s
std::list<pfx_t> 0.345s

So as everyone pointed out, optimization is the dominant term in a tight inner loop using collection types. Even the slowest variation is faster than the fastest unoptimized version. Perhaps more surprising is that the winner changes -- this is likely due to the compiler better recognizing optimization opportunities in the std code than in an OS-provided macro.

Thanks for your answer. lists of object doesn't quite work in cases where the object needs to be in more than one list. — helloworld, Jun 22 '13 at 23:49

Why does this C++ function takes 4 time as much as a c function

3 Answers3