Is slicing + assign numpy equivalent to a for loop in C?

Question

When we do, with numpy (in Python):

my_array[::4] = 0

is it:

for (int i = 0; i < my_array_size; i += 4) {
    my_array[i] = 0;
}

in C?

Or is there SIMD done by numpy?

If so, what are actual cases where rewriting Python code in C is useful, since I wouldn't expect anyone (unless they have a lot of knowledge) to be able to do SIMD manually in C (or C++ for that matter)?

I'm asking this question because I can't make the Sieve of Eratosthenes significantly faster on C than on Python and I was curious as to why. Here are my implementations (Jérôme Richard suggested using bits instead of bytes for the boolean array which is a good idea):

#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
#include <stdint.h>

typedef struct
{
    int *primes;
    int size;
} Result;

Result EratosthenesSieveC(int n)
{
    uint8_t *nbs = (uint8_t *)malloc((n + 1) / 8 + 1);
    int i, j, count = 0;
    double limit = sqrt(n);

    // Initialize the array with true everywhere except for 0 and 1
    nbs[0] = nbs[1] = false;
    for (i = 2; i <= n; i++)
    {
        nbs[i / 8] |= 1 << (i % 8);
    }

    // Apply the Sieve of Eratosthenes algorithm
    for (i = 2; i <= limit; i++)
    {
        if (nbs[i / 8] & (1 << (i % 8)))
        {
            for (j = i * i; j <= n; j += i)
            {
                if (nbs[j / 8] & (1 << (j % 8)))
                {
                    nbs[j / 8] &= ~(1 << (j % 8));
                    count++; // how many numbers are not prime
                }
            }
        }
    }

    // Allocate memory for the array of primes
    int *primes = (int *)malloc((n + 1 - count - 2) * sizeof(int));

    // Store the prime numbers in the 'primes' array
    int index = 0;
    for (i = 2; i <= n; i++)
    {
        if (nbs[i / 8] & (1 << (i % 8)))
        {
            primes[index++] = i;
        }
    }

    // Free the memory used by the 'nbs' array
    free(nbs);

    Result result;
    result.primes = primes;
    result.size = n - count - 1;

    return result;  // Note: I never really deallocate the memory of result which might be causing memory leaks??
}

Numpy:

def EratosthenesSieveFullNumpy(n):
    nbs = np.ones(n+1, dtype=bool)
    nbs[:2] = 0
    for i in range(2, int(n**0.5)+1):
        if nbs[i]:
            nbs[i*i::i] = 0
    return np.where(nbs)[0]

It's worth noting this implementation is really bad:

def EratosthenesSieveNumpy(n):
    nbs = np.ones(n+1, dtype=bool)
    nbs[:2] = 0
    for i in range(2, int(n**0.5)+1):
        if nbs[i]:
            for j in range(i*i, n+1, i):
                nbs[j] = 0
    return np.where(nbs)[0]

Which is why I suspected SIMD to be the reason why I couldn't make my C code faster.

To import my C code in Python I used:

class Result(ctypes.Structure):
    _fields_ = [('primes', ctypes.POINTER(ctypes.c_int)), ('size', ctypes.c_int)]

lib = ctypes.cdll.LoadLibrary(my_path_of_compiled_shared_library)
lib.EratosthenesSieveC.restype = Result

def EratosthenesSieveC(n, lib):
    result = lib.EratosthenesSieveC(n)
    return result.primes[:result.size]

My results are: https://i.imgur.com/1vUJDv7.png

Yes, that's exactly what it is doing under the hood. Numpy is compiled cleverly and was written by very experienced devs so often the Numpy code is faster than naïve c code. For the specific case you are referring to, you will always be limited by ram speed. It is typical for your CPU to run at around 100 flops/ram-cycle which means that if you are not doing much calculation you will be limited by your ram. No matter how clever you are with the CPU there will be no advantage. — Simon Tartakovksy, Jul 23 '23 at 20:17
AFAIR, we do not optimize the addition so far to a constant so it is `i += stride` and stride is a variable, not a compile time constant. We have explicitly supported the case where `stride` is 1 because the compiler can generate a faster code. Some parts of the code are manually vectorized using SIMD instruction but in many cases, the compiler does the job very well (with some additional help sometime). — Jérôme Richard, Jul 23 '23 at 20:30
Note that strided accesses like this are not optimizable using SIMD instruction on most architectures (especially x86-64). There are gather/scatter instructions, but they are not very efficient so far and I do not expect them to be faster in this case. We use them in some rare case though (AFAIR when AVX-512 is enabled and only for indirect indexing). — Jérôme Richard, Jul 23 '23 at 20:32
Rewritting Python code to C is useful when dealing with many small arrays (due to the significant overhead of Numpy calls), for implementing specialized operations that are not yet in Numpy (Numpy only implement generic operations) or a set of existing operations computed in a raw (CPU cache and registers matters a lot), to use multiple threads, and to implement some non-trivial data structures (eg. heaps/trees/hash-maps stored as arrays). — Jérôme Richard, Jul 23 '23 at 20:39
@SimonTartakovksy does SIMD help to overcome RAM speed or not at all? — FluidMechanics Potential Flows, Jul 23 '23 at 20:39
@JérômeRichard I'm not sure I understand what you mean by "AFAIR, we do not optimize the addition so far to a constant so it is i += stride and stride is a variable, not a compile time constant". I'm sorry I'm not an expert, would you mind explaining like I'm 5? — FluidMechanics Potential Flows, Jul 23 '23 at 20:40
@JérômeRichard When you say "Rewritting Python code to C is useful when dealing with many small arrays (due to the significant overhead of Numpy calls)", there is also overhead when calling C from Python right? Although maybe you mean there is some but still less than the Numpy overhead? — FluidMechanics Potential Flows, Jul 23 '23 at 20:42
Regarding the constant, when you write a loop with `i+=stride` and `stride` is a variable not known at compile time, then the target compiler can fail to generate a fast SIMD code. Indeed, SIMD instruction are mainly useful when operating on *contiguous items* (so generally when `i+=1`) : if the accesses are contiguous, then you can often write a specific sequence of SIMD instructions doing the work efficiently, but if the accesses are not stridded, then you need to write a different code (less efficient). The optimal sequence of instructions to use is dependent of the value of this stride. — Jérôme Richard, Jul 23 '23 at 20:52
AFAIK nothing makes RAM fetches faster (assuming you are using cache correctly), just buying faster hardware. At the end of the day the CPU needs to retrieve all the values from RAM regardless of how it uses them. — Simon Tartakovksy, Jul 23 '23 at 20:53
Regarding the overhead, yes, there is an overhead when calling C from Python, but the overhead of some Numpy function can be huge because of the way we have implemented generic operations (so to ease maintenance). For example, Numpy internally has generic iterators to do broadcasting automatically, but they are slow when iterating on small slices (eg. computing the means of 3D points stored as a Nx3 array). Another example: we generally do not support operations on arrays having different types: they need to be converted first (see [this post](https://stackoverflow.com/a/74633899/12939557)). — Jérôme Richard, Jul 23 '23 at 20:58
The overhead can also be significant because of type/value checking, bound checking, index wraparound, etc. In some case, you do not care about this because speed matter more than having a safety net or advanced features (unused in your case but still expensive). There is a price to pay though: if something is wrong in your C code (bug), then the Python interpreter can crash or even be corrupted. — Jérôme Richard, Jul 23 '23 at 21:05
So in this specific case (from my post), either the bottleneck comes from the RAM (so C or numpy we don't care, it'll be equally as slow) either the C code will be as fast as the numpy code because there is no SIMD since it's not an instruction performed on continguous items. There might be a world where the C code is faster but that's just because on small arrays, numpy does more safety checkings and overall numpy has a tiny bit more overhead? — FluidMechanics Potential Flows, Jul 23 '23 at 21:29
Obviously I'm talking about calling this C code from python or using numpy within python. I'm not talking about a world where python is not used at all in which case we gain performance thanks to the compiled time being only done once in the C code — FluidMechanics Potential Flows, Jul 23 '23 at 21:31
@SimonTartakovksy: To avoid bottlenecking on RAM bandwidth or latency, cache-block your code so you get hits in some level of cache, preferably L2. If you don't do that, then yes, SIMD may only help a small amount if scalar asm was already close to saturating memory bandwidth, common with double-precision float especially on server CPUs (lower per-core memory bandwidth than desktops, even with other cores idle). ([How does cache blocking actually speed up performance?](https://stackoverflow.com/q/63614160) / https://en.wikipedia.org/wiki/Loop_nest_optimization). — Peter Cordes, Jul 23 '23 at 22:10
And yes, this means restructuring your algorithms so you do multiple steps on one small part of your array before moving on to other parts of the array. And it's even harder if the operations aren't purely "vertical", e.g. you need the sum of the whole array from the previous step to do something to each element in the next. — Peter Cordes, Jul 23 '23 at 22:12
In this case, I expect the bottleneck to be the RAM indeed assuming the array is large enough not to not fit in the CPU cache (though this is actually a bit more complex in practice). For small arrays, this is pretty complicated. In fact, SIMD instructions could still be used in this specific use-case because the stride is very small. Most details matters when it comes to performance (this is when guessing is so hard and benchmark are often biased). What is the exact type of `my_array`? — Jérôme Richard, Jul 23 '23 at 23:18
Release notes for 1.22 talk about "the ongoing work to provide SIMD support for commonly used functions," — hpaulj, Jul 23 '23 at 23:58
By *"exact type of `my_array`"*, I mean for example `np.int32`, `np.float32`, `np.float64`, etc. You can get this information with `my_array.dtype`. — Jérôme Richard, Jul 24 '23 at 12:46
Oh sorry, the types of the objects inside the arrays are: `np.int32`. — FluidMechanics Potential Flows, Jul 24 '23 at 15:42
I was coding the Sieve of Eratosthenes in Python (with numpy) and in C (then calling it from Python), and then I compared the times of execution of both and I was wondering why numpy was equally as fast as C even though on other benchmarks of the same type I had performed, C was always "winning" by a small but significant margin — FluidMechanics Potential Flows, Jul 24 '23 at 15:43
Ok. Thanks for the context. For a Sieve of Eratosthenes (SoE), the strides are generally quite big in average and the array are not small, so C and Numpy will be about equally fast. The operation will be clearly memory bound. To make this faster, you can encore each number in booleans (1 byte) as it use less RAM and the array might even fit in CPU caches. Even better: you can store each numbers in 1 bit. This is a case where Numpy will not be efficient (bit-manip in Numpy is not fast nor provide a lot of features) compared to C and the final C code should ba fast and faster than Numpy. — Jérôme Richard, Jul 24 '23 at 16:26
"To make this faster, you can encore each number in booleans (1 byte) as it use less RAM and the array might even fit in CPU caches." I already had done that! "Even better: you can store each numbers in 1 bit. This is a case where Numpy will not be efficient (bit-manip in Numpy is not fast nor provide a lot of features) compared to C and the final C code should ba fast and faster than Numpy." That's a very clever idea, I had forgotten booleans were stored on 8 bits even though it was a bit unecessary. However, that would only affect memory complexity right? Not so much time complexity. — FluidMechanics Potential Flows, Jul 25 '23 at 09:35
I implemented the changes and C still (it was the case before) only "wins" for small arrays size (consistent with what you said) and a tiny bit for very large array sizes: https://i.imgur.com/1vUJDv7.png. Surely the memory complexity now has been divided by 8 but I'm still a bit confused. I included the source code of each of my implementation in my post. — FluidMechanics Potential Flows, Jul 25 '23 at 09:38
The complexity is left unchanged but it will make the array 8 times smaller which is a big improvement since the RAM should be the bottleneck (for large arrays). The computation will be more expensive but it should not be much an issue since the algorithm should be memory-bound (for large arrays, still). Besides, you can also store only the odd numbers to make it 16 time smaller. With that, hundred million numbers can be computed only in the L3 cache (which is significantly faster than the RAM). — Jérôme Richard, Jul 25 '23 at 11:03
@JérômeRichard: In a well-optimized (C or asm) SoE for sizes that aren't too huge, a significant amount of time is spent in the first few strides, crossing out 3s, 5s, 7s, etc., since N/3 is much larger than N/37. But maybe I'm just remembering some tuning for medium-size problems where the array fits in L2 or L3 cache, like I commented about in [Sieve of Eratosthenes in x86 assembly](https://codereview.stackexchange.com/posts/comments/548824) and the two other godbolt links in a later comment. For strides much larger than 512, you're not touching every cache line which is interesting. — Peter Cordes, Jul 25 '23 at 15:01
Oh, I think what I'm remembering is that if you use a `uint64_t bitmap[]`, you end up load/modify/storing the same `uint64_t` *many* times with small bit-strides, with such a long dependency chain that out-of-order exec has a hard time hiding it. 16-bit chunks were more efficient for crossing out (since `btr bp, cx` does mod 2^16 for free), but 64-bit chunks are better for searching the next prime. C sucks at accessing the same array with different element widths, asm makes it easy. (I have better versions of the asm in a file locally that I never got around to posting, keeping only odds) — Peter Cordes, Jul 25 '23 at 15:08
"a significant amount of time is spent in the first few strides, crossing out 3s, 5s, 7s, etc., since N/3 is much larger than N/37" yeah of course, my point was to compare python and c imported from python so as long as they were consistent with each other i didn't mind that much — FluidMechanics Potential Flows, Jul 25 '23 at 15:28
Unfortunately asm is way beyond what I know in computer science — FluidMechanics Potential Flows, Jul 25 '23 at 15:29
@PeterCordes Good point. That being said, One can unroll the loop a bit so the compiler should generate instructions in a way close one are independent (AFAIK Clang does that). Tiling should also help a bit for large arrays (ie. discarding multiples of several prime numbers at the same time for a given array slice) since cache lines can be reused (especially for the first prime numbers). — Jérôme Richard, Jul 25 '23 at 20:15

Is slicing + assign numpy equivalent to a for loop in C?

0 Answers0