2

It's my first time posting here and I hope you guys can answer my question.

I'm writing a Python script that finds all of the prime numbers up to some number n using the sieve of eratosthenes. In order to make the program more efficient, I want to sieve the range of numbers in "segments" that fit inside the CPU's L1 cache. I know my own L1 cache, it's 576 KB. But I want this program to work on any computer.

Is there any way I can get the CPU's L1 cache? I want specifically L1 cache, not L2 or L3.

Anish Shanbhag
  • 404
  • 6
  • 15
  • Even if you could, why do you think this will speed up your program in any meaningful way? – Blender Apr 12 '18 at 01:05
  • Do you really have 576KB of L1 cache? I'm pretty sure Intel chips are always powers of two, and only go up to 32K/32K, except for a few Xeons in a past generation that did 64K/32K before they realized it didn't speed up any real-life loads. – abarnert Apr 12 '18 at 01:11
  • 1
    Assuming that you got the size of L1 cache, what will it tell you about the optimal size of the segment? You are not coding in assembly. It's an interpreted scripting language. It will have its own opinion about what belongs into the L1 cache. Setting the size of the segment to the size of the cache probably wouldn't help. On each system, you would have to try out multiple sizes of caches to understand what works best. But then you can start trying out different sizes directly, without looking up the size of the L1 cache in the first place. – Andrey Tyukin Apr 12 '18 at 01:12
  • I have an AMD Ryzen 5 1600, which has 576 KB of total L1 cache. – Anish Shanbhag Apr 12 '18 at 01:19
  • I saw this question regarding cache locality: https://stackoverflow.com/questions/12065774/why-does-cache-locality-matter-for-array-performance and I also read here: http://warp.povusers.org/programming/sieve_of_eratosthenes.html – Anish Shanbhag Apr 12 '18 at 01:19
  • "Total L1 cache" doesn't mean anything at all. You have 32KB data cache per core. The fact that there's another 160KB sitting around on the other cores you're using doesn't help, and the fact that there's 384KB of instruction cache helps even less. – abarnert Apr 12 '18 at 01:21
  • 1
    At any rate, if you want to optimize a Sieve of Eratosthenes, your first steps should be: (1) Make sure you actually have implemented the sieve, and not an equivalent of the much slower algorithm that got published in a Scheme textbook and has spread over the whole world ever since. (2) Use PyPy instead of CPython, or numpy, or possibly gmpy. (3) Profile the Python code and optimize the high-level Python stuff. (4) Profile it again and micro-optimize little things like global lookups … (42) Profile the interpreter while it's running your code and worry about the CPU cache. – abarnert Apr 12 '18 at 01:26
  • Anyway, to answer your direct question: the `cpuid` instruction and its relatives are the way to get information about the cache. They should even report the relevant 32KB L1 data cache per core rather than the useless 576KB number you're looking for. Python of course has no way to execute arbitrary assembly instructions directly, but there are multiple modules on PyPI that do it. I have no idea how up-to-date or reliable any of them are. – abarnert Apr 12 '18 at 01:45
  • If you want to do it yourself, you can create a C extension library, or a regular C library that you access via `ctypes` or `cffi`. [This answer](https://stackoverflow.com/a/32545972/908494) shows how to write a C function with inline assembly that I think will work on Mac and Linux with clang and gcc; you'll probably need to find another source for how to do it on Windows with MSVC. (Or maybe just look at the source of those packages on PyPI.) [Wikipedia](https://en.wikipedia.org/wiki/CPUID) has a pretty good explanation of the basics of using `cpuid`. – abarnert Apr 12 '18 at 01:47

1 Answers1

4

Python is a "garbage collection" language. One such consequence of this is that memory is automatically allocated and freed as needed. This creates memory fragmentation which can break apart transfers to the CPU caches. It's also not possible to change the layout of a data structure directly in memory which means that one transfer on the bus might not contain all the relevant information for a computation — even though it might all fit within the bus width. It essentially hurts any prospects for keeping L1/L2 caches filled with the relevant data for the next computation.

Another problem comes from Python’s dynamic types and not being compiled. Many C developers generally realize at some point the compiler is usually smarter than they are. The compiler can perform many tricks to affect how things are laid out, how the CPU will run certain instructions, in what order, and the best way to optimize them. Python, however, is not compiled and, to make matters worse, has dynamic types which means that inferring any possible opportunities for optimizations with an algorithm is exponentially more difficult since code functionality can be changed during runtime.

As mentioned in the comments, there ways to mitigate such problems, foremost being CPython or the Cython extensions for Python — it allows Python code to be compiled.

l'L'l
  • 44,951
  • 10
  • 95
  • 146