Memory alignment for fast FFT in Python using shared arrays

Question

I write an image processing app that needs to do multiple things and it has to do them as much real-time as possible. Acquisition of the data and their processing runs in separate processes (mainly for performance reasons). The data itself is quite large (2MPix 16-bit grayscale images).

I can share arrays between processes as it is described in this post: How do I pass large numpy arrays between python subprocesses without saving to disk? (I use the shmarray script from the numpy-shared package). I can perform the supplied Numpy FFT on those data without problem, but it is quite slow.

Calling FFTW would probably be much faster, but in order to fully benefit from it, I am supposed to run my operations on arrays that are memory aligned.

The question: Is there a way how to create and share Numpy-like arrays between processes, that are, at the same time, guaranteed to be memory aligned?

What Python FFTW bindings do you use? These bindings should provide a way to allocate correctly aligned memory blocks. — Sven Marnach, Mar 27 '12 at 19:01
You are right, but the tricky part is that I need to share those arrays between multiple processes. I am quite sure that those functions don't create "shared" arrays. — bubla, Mar 27 '12 at 21:19

Sven Marnach · Accepted Answer · 2012-03-27T23:04:02.113

10

The simplest standard trick to get correctly aligned memory is to allocate a bit more than needed and skip the first few bytes if the alignment is wrong. If I remember correctly, NumPy arrays will always be 8-byte aligned, and FFTW requires 16-byte aligment to perform best. So you would simply allocate 8 bytes more than needed, and skip the first 8 bytes if necessary.

Edit: This is rather easy to implement. The pointer to the data is available as an integer in the ctypes.data attribute of a NumPy array. Using the shifted block can be achieved by slicing, viewing as a different data type and reshaping -- all these won't copy the data, but rather reuse the same buf.

To allocate an 16-byte aligned 1000x1000 array of 64-bit floating point numbers, we could use this code:

m = n = 1000
dtype = numpy.dtype(numpy.float64)
nbytes = m * n * dtype.itemsize
buf = numpy.empty(nbytes + 16, dtype=numpy.uint8)
start_index = -buf.ctypes.data % 16
a = buf[start_index:start_index + nbytes].view(dtype).reshape(m, n)

Now, a is an array with the desired properties, as can be verified by checking that a.ctypes.data % 16 is indeed 0.

edited Mar 27 '12 at 23:04

answered Mar 27 '12 at 19:06

Sven Marnach

574,206
118
941
841

Could you provide an example, please? Let's suppose, for instance, that I want to deal with 2D complex numbers arrays. – bubla Mar 27 '12 at 21:23
I see two problems here: 1. How to find out how many bytes to skip? 2. How to carry out the skip and get a rectangular array of desired dimensions as the output? 3. Do you think that contributing this (maybe quite simple) trick to the shmarray module would make sense? ( https://bitbucket.org/cleemesser/numpy-sharedmem/raw/5ca092f8222a/shmarray.py )? – bubla Mar 27 '12 at 21:29
@MarkBorgerding: Python has automatic memory management -- you won't have to bother with such details. – Sven Marnach Mar 27 '12 at 22:54
@bubla: Sorry for not including the code example right from the beginning. The code in the edit to my answer should cover 1 and 2. Regarding 3, I don't think this would be necessary since the above trick can also be applied to an array allocated by `shmarray`. – Sven Marnach Mar 27 '12 at 22:56
Cool, your solution works. The shifted shm-allocated array is indeed accessible from other processes. Many thanks! – bubla Mar 27 '12 at 23:45

score 3 · Answer 2 · answered Nov 29 '13 at 20:49

Generalizing on Sven's answer, this function will return an aligned copy (if needed) of any numpy array:

import numpy as np
def aligned(a, alignment=16):
    if (a.ctypes.data % alignment) == 0:
        return a

    extra = alignment / a.itemsize
    buf = np.empty(a.size + extra, dtype=a.dtype)
    ofs = (-buf.ctypes.data % alignment) / a.itemsize
    aa = buf[ofs:ofs+a.size].reshape(a.shape)
    np.copyto(aa, a)
    assert (aa.ctypes.data % alignment) == 0
    return aa

Dan Houghton · Answer 3 · 2021-01-21T10:49:57.843

0

I ran payne's answer in 2021 and got type errors (Python 3.7, Numpy 1.18.5), so I've adjusted the code:

def aligned(a, alignment = 16):
    if (a.ctypes.data % alignment) == 0:
        return a
    assert alignment % a.itemsize == 0
    extra = alignment // a.itemsize
    buf = np.empty(a.size + extra, dtype = a.dtype)
    ofs = (-buf.ctypes.data % alignment) // a.itemsize
    aa = buf[ofs:ofs + a.size].reshape(a.shape)
    np.copyto(aa, a)
    assert aa.ctypes.data % alignment == 0
    return aa

I changed it to use integer division to remove the type errors, and added an extra assert for a sanity check.

edited Jan 21 '21 at 10:49

answered Jan 20 '21 at 09:12

Dan Houghton

3
2

While this may prove useful, you should provide detailed explanation. Maybe an edit or a comment on a prior answer would be more appropriate? – PJProudhon Jan 20 '21 at 09:46
Unfortunately I am a new member, and neither cannot comment ("you must have 50 reputation to comment") or edit ("edit queue is full") – Dan Houghton Jan 21 '21 at 10:48

Memory alignment for fast FFT in Python using shared arrays

3 Answers3

Linked