Fast memcpy for small unaligned data

Question

I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.

A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:

// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
  unsigned char* q = reinterpret_cast<unsigned char*>(x);
  for (int i = 0; i < 8; ++i) {
    q[i] = p[i];
  }
}

It I use the following code, I get huge speedup on x86-64

void f1(double* x, const unsigned char* p) {
  double* r = reinterpret_cast<const double*>(p);
  *x = *r;
}

But, as I understand, the program would crash on ARM if p is not 8-byte aligned.

Here are my questions:

Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?

Here is a small benchmark to test on your machine

#include <chrono>
#include <iostream>

void copy_int_0(int* x, const unsigned char* p) {
  unsigned char* q = reinterpret_cast<unsigned char*>(x);
  for (std::size_t i = 0; i < 4; ++i) {
    q[i] = p[i];
  }
}

void copy_double_0(double* x, const unsigned char* p) {
  unsigned char* q = reinterpret_cast<unsigned char*>(x);
  for (std::size_t i = 0; i < 8; ++i) {
    q[i] = p[i];
  }
}

void copy_int_1(int* x, const unsigned char* p) {
  *x = *reinterpret_cast<const int*>(p);
}

void copy_double_1(double* x, const unsigned char* p) {
  *x = *reinterpret_cast<const double*>(p);
}

int main() {
  const std::size_t n = 10000000;
  const std::size_t nb_times = 200;
  unsigned char* p = new unsigned char[12 * n];
  for (std::size_t i = 0; i < 12 * n; ++i) {
    p[i] = 0;
  }
  int* q0 = new int[n];
  for (std::size_t i = 0; i < n; ++i) {
    q0[i] = 0;
  }
  double* q1 = new double[n];
  for (std::size_t i = 0; i < n; ++i) {
    q1[i] = 0.0;
  }

  const auto begin_0 = std::chrono::high_resolution_clock::now();
  for (std::size_t k = 0; k < nb_times; ++k) {
    for (std::size_t i = 0; i < n; ++i) {
      copy_int_0(q0 + i, p + 12 * i);
      copy_double_0(q1 + i, p + 4 + 12 * i);
    }
  }
  const auto end_0 = std::chrono::high_resolution_clock::now();
  const double time_0 =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
          .count();
  std::cout << "Time 0: " << time_0 << " s" << std::endl;

  const auto begin_1 = std::chrono::high_resolution_clock::now();
  for (std::size_t k = 0; k < nb_times; ++k) {
    for (std::size_t i = 0; i < n; ++i) {
      copy_int_1(q0 + i, p + 12 * i);
      copy_double_1(q1 + i, p + 4 + 12 * i);
    }
  }
  const auto end_1 = std::chrono::high_resolution_clock::now();
  const double time_1 =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
          .count();
  std::cout << "Time 1: " << time_1 << " s" << std::endl;
  std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;

  delete[] q1;
  delete[] q0;
  delete[] p;

  return 0;
}

The results I get are

clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s

g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s

icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s

I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.

almost certainly your toolchain's memcpy will already be optimized in ways you cannot even imagine. For sure it is doing chunks not bytes. Did you actually try memcpy? — pm100, Jan 18 '18 at 00:24
"need it as fast as you can" --> insure `p` is properly aligned before the `f1()` call. Should not be so hard, but then the calling code is not posted. — chux - Reinstate Monica, Jan 18 '18 at 00:27
@chux: I can not insure that `p` is properly aligned as it is a pointer to a buffer that has been loaded from a binary file. Moreover, I have no control over that binary file format as it is a legacy one. — InsideLoop, Jan 18 '18 at 00:29
for example https://github.com/lattera/glibc/blob/master/string/memcpy.c — pm100, Jan 18 '18 at 00:30
You still have options. Code does not need to call `f1(&x, p)`. Yet without seeing the calling code the discussion is theoretical rather than practical. IOWs, the place for real optimizations lies above `f1()`. — chux - Reinstate Monica, Jan 18 '18 at 00:32
On Linux I would simply `mmap` the file, and swallow it whole... — Sam Varshavchik, Jan 18 '18 at 01:00
@user3629249 the posted code is C++, but the basic issue is equally applicable in both C and C++. — plugwash, Jan 30 '20 at 21:42

score 6 · Accepted Answer · answered Jan 18 '18 at 01:11

Is the second program guaranteed to work on both x86 and x86-64?

No.

When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).

A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.

This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.

In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.

How would you write such a function on ARM if you need it as fast as you can?

The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.

Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().

If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().

FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.

Thanks so much. I have missed the fact the the compiler could convert `std::memcpy(a, b, n)` to its own code when `n` is known at compile time. I am so used to fight the conversion from a for loop to a real call to `std::memcpy` which is done a lot by the Intel compiler and might slow down the code for small copies. Using `std::memcpy` works beautifully on gcc, clang and icpc. I'll cross my fingers and try Visual Studio 2015 tomorrow. — InsideLoop, Jan 18 '18 at 01:29
I still don't understand your comment on packed union. What would I get from not using `reinterpret_cast` as it is basically an no-operation on the instruction level. — InsideLoop, Jan 18 '18 at 01:34
If you use a packed union the compiler knows that you may still be accessing unaligned data, so it will not generate instructions which assume your data is aligned. `reinterpret_cast`, OTOH, does effectively tell the compiler that it can assume the data is aligned to `alignof(double)`. It also performs better than `memcpy()` on some older compilers (see Yann's blog post on fastcompression.blogspot.fr which I linked to). The code I linked to is basically an abstraction over memcpy and packed unions which tries to choose the fastest implementation (at compile time). — nemequ, Jan 18 '18 at 01:41

Fast memcpy for small unaligned data

1 Answers1