Why is memcpy slower than a reinterpret_cast when parsing binary data?

Question

TLDR: I forgot to enable compiler optimizations. With the optimizations enabled the performance is (nearly) identical.

Original post

When reading integer from binary data I noticed that memcpy is slower than a casting solution.

Version 1: reinterpret_cast, smelly due to potential alignment problems, but also faster (?)

int get_int_v1(const char * data) { return *reinterpret_cast<const int*>(data); }

Version 2: memcpy, correct and a little slower:

int get_int_v2(const char * data) { int result; memcpy(&result, data, sizeof(result)); return result; }

I have a benchmark on Ideone.

For future reference, the code is:

#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iostream>
#include <vector>
#include <sys/time.h>

double get_current_time()
{
    timeval tv;
    gettimeofday(&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

int get_int_v1(const char * data) { return *reinterpret_cast<const int*>(data); }
int get_int_v2(const char * data) { int result; memcpy(&result, data, sizeof(result)); return result; }

const unsigned iterations = 200 * 1000 * 1000;

double test_v1(const char * c, unsigned & prevent_optimization)
{
    double start = get_current_time();
    for (unsigned i = 0; i != iterations; ++i)
    {
        prevent_optimization += get_int_v1(c);
    }
    return get_current_time() - start;
}

double test_v2(const char * c, unsigned & prevent_optimization)
{
    double start = get_current_time();
    for (unsigned i = 0; i != iterations; ++i)
    {
        prevent_optimization += get_int_v2(c);
    }
    return get_current_time() - start;
}

int main()
{
    srand(time(0));

    // Initialize data
    std::vector<int> numbers(1000);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = i;
    }

    // Repeat benchmark 4 times.
    for (unsigned i = 0; i != 4; ++i)
    {
        unsigned p = 0;
        std::vector<int>::size_type index = rand() % numbers.size();
        const char * c = reinterpret_cast<const char *>(&numbers[index]);    
        std::cout << "v1: " << test_v1(c, p) << std::endl;
        std::cout << "v2: " << test_v2(c, p) << std::endl << std::endl;
    }
}

And the results are:

v1: 0.176457
v2: 0.557588

v1: 0.17654
v2: 0.220581

v1: 0.176826
v2: 0.22012

v1: 0.176131
v2: 0.220633

My questions are:

Is my benchmark correct?
If yes, then why is v2 (with memcpy) slower? Since both version return a copy of the data I think there should be no difference in performance.
How can I implement a solution that is correct and fast?

Update

I was being silly and forgot to consider that Ideone doesn't perform compiler optimizations. I also tweaked the code a little and came up with the following:

#include <algorithm>
#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iomanip> 
#include <iostream> 
#include <vector>
#include <sys/time.h>

double get_current_time()
{
    timeval tv;
    gettimeofday(&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

struct test_cast
{
    int operator()(const char * data) const 
    {
        return *((int*)data);
    }
};

struct test_memcpy
{
    int operator()(const char * data) const 
    {
        int result;
        memcpy(&result, data, sizeof(result));
        return result;
    }
};

struct test_std_copy
{
    int operator()(const char * data) const 
    {
        int result;
        std::copy(data, data + sizeof(int), reinterpret_cast<char *>(&result));
        return result;
    }
};

enum
{
    iterations = 2000,
    container_size = 2000
};

std::vector<int> get_random_numbers()
{
    std::vector<int> numbers(container_size);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = rand();
    }
    return numbers;
}

std::vector<int> get_random_indices()
{
    std::vector<int> numbers(container_size);
    for (std::vector<int>::size_type i = 0; i != numbers.size(); ++i)
    {
        numbers[i] = i;
    }
    std::random_shuffle(numbers.begin(), numbers.end());
    return numbers;
}

template<typename Function>
unsigned benchmark(const Function & f, unsigned & counter)
{
    std::vector<int> container = get_random_numbers();
    std::vector<int> indices = get_random_indices();
    double start = get_current_time();
    for (unsigned iter = 0; iter != iterations; ++iter)
    {
        for (unsigned i = 0; i != container.size(); ++i)
        {
            counter += f(reinterpret_cast<const char*>(&container[indices[i]]));
        }
    }
    return unsigned(0.5 + 1000.0 * (get_current_time() - start));
}

int main()
{
    srand(time(0));
    unsigned counter = 0;

    std::cout << "cast:      " << benchmark(test_cast(),     counter) << " ms" << std::endl;
    std::cout << "memcpy:    " << benchmark(test_memcpy(),   counter) << " ms" << std::endl;
    std::cout << "std::copy: " << benchmark(test_std_copy(), counter) << " ms" << std::endl;
    std::cout << "(counter:  " << counter << ")" << std::endl << std::endl;

}

The results are now nearly equal (except for std::copy which is slower for some reason):

g++ -o test -O0 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      56 ms
memcpy:    60 ms
std::copy: 290 ms
(counter:  2854155632)

g++ -o test -O1 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      9 ms
memcpy:    14 ms
std::copy: 20 ms
(counter:  3524665968)

g++ -o test -O2 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      4 ms
memcpy:    5 ms
std::copy: 20 ms
(counter:  2590914608)

g++ -o test -O3 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      4 ms
memcpy:    5 ms
std::copy: 18 ms
(counter:  2590914608)

Well, because `reinterpret_cast` does nothing (it's just a hint for the compiler and is `O(0)`, but memcpy iterates through the data and it is `O(n)`. — , Oct 29 '12 at 07:05
However, the reinterpret_cast is fishy. Can I change my code so that I have the speed *and* correctness? — StackedCrooked, Oct 29 '12 at 07:19

Steve Jessop · Accepted Answer · 2012-10-29T10:16:07.253

You need to look at the emitted code. Obviously the optimizer "should" be able to turn the memcpy into a single potentially-unaligned int-sized read into the return value, but if you see different times then I reckon on x86 that means it hasn't.

On my machine, using gcc with -O2 I get 0.09 for all times. With -O3 I get 0 for all times (I haven't checked whether that's faster than the time granularity, or that the optimizer has removed all your code).

So fairly likely, the answer is just that you haven't used the right compiler flags (or ideone hasn't).

On an architecture where a potentially-unaligned read requires different instructions from an aligned read, then the reinterpret_cast could emit an aligned read while the memcpy might have to emit an unaligned read (depending how the function is called -- in this case the data is in fact aligned but I don't know under what conditions the compiler can prove that). In that case I would expect that the reinterpret_cast code could be faster than the memcpy, but of course it would be incorrect in the case where someone passes in an unaligned pointer.

I also get 0 times when using -O3. I also suspect that the compiler optimized *everything* away. I'll update my program soon, but currently it's a little frantic at the job :) — StackedCrooked, Oct 29 '12 at 09:47
@StackedCrooked: yes, your `prevent_optimization` is a valiant effort but once the function is inlined, the optimizer still has a chance to notice that it's unused and elide it. You could print its value at the end of the test. — Steve Jessop, Oct 29 '12 at 09:48

SomeWittyUsername · Answer 2 · 2012-10-29T07:11:43.370

6

Casting is a compile-time operation while memcpy() is a run-time operation. That's the reason for casting having no impact on the running time.

edited Oct 29 '12 at 07:11

answered Oct 29 '12 at 07:06

SomeWittyUsername

18,025
3
42
85

Yeah, pretty much. Maybe a bit better explained: it is a compile-time-**only** operation (some compile-time stuff *do have* effect on the running time, e. g. optimization). – Oct 29 '12 at 07:08
In both cases the actual copy is made at run-time. I don't see how the reinterpret_cast could benefit from compile-time optimization. – StackedCrooked Oct 29 '12 at 07:15
@StackedCrooked In your first func you copy only a single int value each time (at function return). In your second func you do it at function return and also by `memmcpy` (+ overhead of func call etc.) – SomeWittyUsername Oct 29 '12 at 07:29
This sort of answers the title of the question, but have you actually looked at the code of `get_int_v1` and `get_int_v2`? – Steve Jessop Oct 29 '12 at 09:50
@SteveJessop if you're talking about dereferencing of a casting result I actually thought of it but since it's being an rvalue the compiler almost surely optimizes it and prevents unnecessary copy. For the completeness, however, your remark should be included in the answer. – SomeWittyUsername Oct 29 '12 at 10:12

score 3 · Answer 3 · answered Oct 29 '12 at 07:33

3

memcpy cannot copy to a register, it does a memory-to-memory copy. The reinterpret_cast in get_int_v1 can change the type of pointer held in a register, and that doesn't even require a register-to-register copy.

answered Oct 29 '12 at 07:33

MSalters

173,980
10
155
350

3

That's true, although I've seen compiler optimize out the memcpy entirely when it's a simple case – jcoder Oct 29 '12 at 08:37
GCC does reliably optimize memcpy fully for small fixed-size copies, especially when the size is `sizeof(int)`. e.g. it will reliably compile `memcpy(&my_int, src, sizeof(my_int);` to just a 4-byte load instruction from memory, on machines where load instructions don't have an alignment requirement. (Or if src is known to be aligned, it will do it for ISAs like MIPS or old ARM as well.) You definitely don't typically get a call to the library function, or a copy to stack memory and a reload from there. – Peter Cordes Feb 20 '20 at 04:01
True, compilers are getting much better in that regard. – MSalters Feb 20 '20 at 04:15

Why is memcpy slower than a reinterpret_cast when parsing binary data?

TLDR: I forgot to enable compiler optimizations. With the optimizations enabled the performance is (nearly) identical.

Original post

Update

3 Answers3

Linked