Replacing memcpy in a safe way

Question

I want to replace memcpy with my own optimized version to do some benchmarks. I wouldn't like to modify each place in code which calls memcpy(it's a large code base and I want to avoid lots of changes). So what I did is the following:

// in a "common" header file which is included everywhere
#ifdef SHOULD_OPTIMIZE
    #define memcpy my_on_steroids_memcpy
#endif

The above works and replaces memcpy with my own implementation but it seems crude, forced and not safe at all. Is it any other alternative so that I could replace the library memcpy without modifying the rest of the code? Should I forget about the above as it does not seem an adviseable thing to do and just modify all the other files(and why)?

On Linux you can use [LD_PRELOAD](http://stackoverflow.com/questions/426230/what-is-the-ld-preload-trick). Out of curiosity, what does your optimized version of `memcpy` do? — Kerrek SB, Sep 04 '11 at 19:23
You did a better memcpy? Wow! If it's "equivalent" to `memcpy` (same prototype, same behaviour) I don't see any problem in using a `#define`. MS does the same for the debug version of `malloc` and `free` — xanatos, Sep 04 '11 at 19:24
@Kerreb But often `memcpy` is inlined. He would need to disable inlining of functions. — xanatos, Sep 04 '11 at 19:25
With GCC `-fno-builtin-memcpy` can be used to control that behaviour. — user786653, Sep 04 '11 at 19:31
That might also invalidate the benchmark, at least if you care about real-world results. I would personally let gcc keep inlining `memcpy` instances it thinks it can inline (short, fixed size/alignment, etc.) since there's absolutely no way you can beat those with a custom implementation. Tweaking `memcpy` only benefits large copies. — R.. GitHub STOP HELPING ICE, Sep 04 '11 at 19:38
It might also make sense, rather than trying to replace `memcpy` in general, to figure out the top 1-5 most costly `memcpy` invocations in your program and just replace those. You could then apply additional constraints (like known alignment values) that would allow your custom `memcpy` to be even faster, without having to worry about how it affects other parts of the program or libraries, and it would be trivial to switch a few places to calling it manually. — R.. GitHub STOP HELPING ICE, Sep 04 '11 at 19:41
@R.. a very sensible advice. The processing I'm doing is very much memory bound - lots of copying, moving around and computations applied on pretty large buffers. I'm trying to see if a global replace with a faster memcpy implementation(if it turns out to be faster) would bring a global(even if only 5% or 10%) increase in performance(I'm acting on a hunch and I'm a bit gambling but I think it's worth the try). — celavek, Sep 04 '11 at 21:34
I think you could improve performance by *a lot* more if you instead spent your time figuring out where it's actually necessary to copy data and where you might be able to replace copying by references to existing copies of the data. As an extreme example, the performance difference between MPlayer and gstreamer is huge, and the main reason is that MPlayer takes care to directly share buffers wherever possible, while gstreamer's design is full of unnecessary copies and conversions... — R.. GitHub STOP HELPING ICE, Sep 05 '11 at 00:33

score 2 · Answer 1 · answered Sep 04 '11 at 19:25

2

Some compilers have a way to include a header from the command line. For example, g++ and gcc can be called with an -include option.

However, I'd be sure your code at least compiles and runs without the custom header, as it's considered "bad manners" for your code to fail without "mystery" compiler flags.

Also: the standard library implementations of memcpy usually already are optimized with SSE2 optimizations and the like. You probably won't be able to do better.

answered Sep 04 '11 at 19:25

Max E.

1,837
13
15

They do indeed use SSE. However, if the compiler cannot statically determine the alignment, it needs to insert a branch to check if it really is aligned before using the SSE codepath. Alternatively, it could use unaligned SSE, but they tend to be a lot slower. – Mysticial Sep 04 '11 at 19:28
Yeah, though I'm not sure whether it's possible to improve on that without doing lots of extra work for each invocation of your custom memcpy, in which case you may as well baby the standard memcpy along instead-- for example, #define malloc(x) memalign(sizeof(size_t), x) or something. – Max E. Sep 04 '11 at 19:35
FWIW, I remember reading that there are trick (in active use) that add a bit of "normal" copying at the start and the end to hand unaligned parts of memory and use SSE for the (much larger) rest. – Sep 04 '11 at 19:35
@Mysticial: The glibc implementation always uses an SSE codepath when the processor supports it and the data is large enough to warrant it. – Dietrich Epp Sep 04 '11 at 19:40
@delnan: Correct, but if you already know your data is aligned a certain way, you can skip this "warm-up/cool-down" code. If the compiler doesn't know it's always aligned, it will have to insert clean-up code like this. (or skip SSE altogether) – Mysticial Sep 04 '11 at 19:41
I probably won't be able to do better. I probably will. Anyway I have to try, I have to learn, even if it's for the sake of experiment. – celavek Sep 04 '11 at 21:29
@Dietrich Epp what would be a large enough data to warrant the use of any SIMD code path in this context? Would for example a buffer exceeding the L1 cache size be considered enough? – celavek Sep 04 '11 at 21:41
@celavek: The code doesn't know how big the L1 cache size. Since glibc is open source, you can look at the implementation yourself in `glibc/sysdeps/i386/i686/multiarch/memcpy*.S`. – Dietrich Epp Sep 04 '11 at 21:46
If the data isn't aligned, then my guess is the threshold is around 100 - 1000 bytes depending on the system. However, if you know you that both your data and your lengths are always aligned to the SIMD vector, then that threshold is zero. (As in, ALWAYS use SSE.) *That's how big the overhead of the cleanup code can be if the data isn't aligned. – Mysticial Sep 04 '11 at 21:49
@Mysticial: The problem with these guesses is nobody knows if they're right unless you actually do some testing. I suggest you provide us with some hard facts, because I suspect your numbers are far, far off. – Dietrich Epp Sep 04 '11 at 21:52
@Dietrich Epp: That's why I gave such a large bound. It's because I don't know how big it is and I can only guess it to an order of magnitude based on prior experience. (Not to mention it's also dependent on the hardware.) – Mysticial Sep 04 '11 at 21:54
@Dietrich Epp Of course it doesn't(although sometimes it does or it's told). I was just asking as you mentioned that about the glibc implementation. – celavek Sep 04 '11 at 22:00
Here's one trivial example: Suppose you add a branch to check for alignment. Now suppose that branch gets mispredicted and you take a 10 cycle penalty. In 10 cycles, you could have copied 160 bytes (16 bytes/cycle via SSE) - not counting the alignment check as well as other clean-up code. If this branch isn't inlined, then there's a good chance that the branch predictor may never "lock-on" if the memcpy() is called from other places as well. – Mysticial Sep 04 '11 at 22:02

score 1 · Answer 2 · answered Sep 04 '11 at 19:24

1

I'm going to assume that you're running linux...

The attached link is an example on how to use LD_PRELOAD to replace existing functions in an application. The example takes a normal malloc call, and then ensures that the memory has been zeroed out. This should be fairly obvious how to translate it to memcpy.

https://gist.github.com/701897

answered Sep 04 '11 at 19:24

Bill Lynch

80,138
16
128
173

1

This won't work with `memcpy` because it is so often inlined. – Dietrich Epp Sep 04 '11 at 19:27
@Dietrich: Yep. That's correct. Unless I completed with `-O0`, the interposing code was not called. – Bill Lynch Sep 04 '11 at 19:35

score 1 · Answer 3 · answered Sep 04 '11 at 19:39

If you're on linux, memcpy is already very optimized, probably even too much so (I think we noticed a crash once with memcpy over a page border).

That said, you're perfectly allowed to define a replacement memcpy in your program. It will be called instead of the C library one. You don't have to do anything else than that.

celavek · Accepted Answer · 2011-09-06T08:25:11.367

I just found another way to replace the memcpy function call. It only works with GCC(I still need to find another way for VC++) but I think it's definitely better than the crude #define way. It uses the __REDIRECT macro(in sys/cdefs.h included via features.h), which from what I've seen it's used extensively in the glibc. Below follows an example with a small test:

// modified.h
#pragma once

#ifndef MODIF_H_INCLUDED_
#define MODIF_H_INCLUDED_

#include <cstddef>
#include <features.h>

extern "C"
{
void test_memcpy(void* __restrict to, const void* __restrict from, size_t size);
}

#if defined(__GNUC__)
void __REDIRECT(memcpy, (void* __restrict to, const void* __restrict from, size_t size),
                test_memcpy);
#endif /* __GNUC__ */

#endif /* MODIF_H_INCLUDED_ */

//modified.cpp
extern "C" void test_memcpy(void* __restrict to, const void* __restrict from, 
                            size_t size)
{
    std::cout << "Dumb memcpy replacement!\n";
}

//original.h
#pragma once

#ifndef ORIG_H_INCLUDED_
#define ORIG_H_INCLUDED_

void test_with_orig();

#endif /* ORIG_H_INCLUDED_ */

//original.cpp
#include <cstring>
#include <iostream>

void test_with_orig()
{
    int* testDest = new int[10];
    int* testSrc = new int[10];

    for (unsigned int i = 0; i < 10; ++i)
    {
            testSrc[i] = i;
    }

    memcpy(testDest, testSrc, 10 * sizeof(int));

    for (unsigned int i = 0; i < 10; ++i)
    {
            std::cout << std::hex << "\nAfter standard memcpy - " 
            << "Source: " << testSrc[i] << "\tDest: " << testDest[i] << "\n";
    }
}

// and a small test
#include "modified.h"
#include "original.h"

#include <iostream>
#include <cstring>

int main()
{
    int* testDest = new int[10];
    int* testSrc = new int[10];

    for (unsigned int i = 0; i < 10; ++i)
    {
            testSrc[i] = i;
            testDest[i] = 0xDEADBEEF;
    }

    memcpy(testDest, testSrc, 10 * sizeof(int));

    for (unsigned int i = 0; i < 10; ++i)
    {
            std::cout << std::hex << "\nAfter memcpy replacement - " 
            << "Source: " << testSrc[i] << "\tDest: " << testDest[i] << "\n";
    }

    test_with_orig();

    return 0;
}

score 0 · Answer 5 · answered Sep 04 '11 at 19:24

0

If you do not want to change existing code, this seems to be the only available solution; and it should not be too bad, as the compiler will complain if the signature of your own memcpy does not match the default one.

That said, I would heavily doubt you will manage to squeeze out considerable better performance than the memcpy that comes with the standard library. But are you copying that much memory that it becomes an issue at all?

answered Sep 04 '11 at 19:24

Janick Bernet

20,544
2
29
55

I'm not copying an extreme amount but I'm doing it very, very often. – celavek Sep 04 '11 at 21:37

Replacing memcpy in a safe way

5 Answers5