5

Trying to understand whether python libraries are compiled because I want to know if the interpreted code I write will perform the same or worse.

e.g. I saw it mentioned somewhere that numpy and scipy are efficient because they are compiled. I don't think this means byte code compiled so how was this done? Was it compiled to c using something like cython? Or was it written using a language like c and compiled in a compatible way?

Does this apply to all modules or is it on a case-by-case basis?

darkace
  • 838
  • 1
  • 15
  • 27

3 Answers3

4

NumPy and several other libraries are partly wrappers for code written in C and other languages like FORTRAN, which when compiled will run faster than Python. This helps by avoiding the cost of loops, pointer indirection and per-element dynamic type checking in Python. This is explained in this question:

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

Python code that is compiled to bytecode (.pyc files) is a separate topic, in which python scripts are compiled to increase startup performance (see this question).

Community
  • 1
  • 1
jeevcat
  • 343
  • 2
  • 12
  • Riiight, so it depends on the specific library and how it was built. In theory if I wanted to create my own efficient Python library I could build it in another language and then write a wrapper for it, as you mentioned. Do you have any idea if the 'widely used' python libraries are written like this? How might I check? – darkace Sep 27 '16 at 11:31
  • 2
    If you look in the [Install docs for NumPy](http://docs.scipy.org/doc/numpy-1.10.1/user/install.html#building-from-source), you can see that many of the NumPy modules need a C or FORTRAN compiler. AFAIK many scientific and mathematic libraries use lower-level languages including a lot of the SciPy stack. – jeevcat Sep 27 '16 at 11:48
3

Low-Level Compiled Languages and Performance

The answers by @hpaulj and @jeevcat are correct.

But the story of whether Python is compiled is more complex.

First, it is true that well written code in C++ is far faster than well written Python code. And that compiled code generally allows for faster calculations.

But the reason is not because the code is compiled, per se. It's because these compiled languages are typically also lower level languages that let you manipulate memory directly, avoid garbage collection, etc. Moreover, to allow for Python dynamicism and simplicity, everything is an object. So a Python list, for instance, is an object with a list of references to other objects "scattered" throughout memory. This is (obviously) less computationally efficient than a memory block with all values in the list next to each other.

And, as the others mentioned, the Python code just calls (talks to) this other, more efficient C code.

Is Python Compiled?

But there is a more interesting question. Is Python compiled or not? A few people may unwittingly claim that it is not compiled. This is not strictly true. Any time you import a package or module, it will invisibly be compiled and saved if it has not already been compiled. (You will likely not even notice any compilation happening.)

You can see this happen: any .pyc file (a file ending in .pyc instead of .py) is a compiled Python file. Try to open a .pyc file in an editor or via cat. You'll see that it is a binary file and will look like gibberish.

Looking at the Invisible Creation of Compiled Python Code

How to create compiled Python code?

Let's say that you have the following folder structure:

❯ tree -L 1
.
├── __pypackages__ # This is a folder, the rest are files
├── addressbook.proto
├── addressbook_pb2.py
├── pdm.lock
├── protobuf-python-3.17.3.tar.gz
├── pyproject.toml
└── readme.txt

(The above structure above contains the Python Google Protocol Buffer example, using the modern PDM package manager structure.)

We can see that the only Python module (file) is addressbook_pb2. So, let's import that file:

❯ python
Python 3.9.7 (default, Oct 13 2021, 06:45:31) 
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import addressbook_pb2
>>>   [exit out of Python]
❯ 

I did nothing except for quickly import the file (module) addressbook_pb2.py. But just that simple import created an entire "compiled code folder" called __pycache__ with the compiled module in it:

❯ tree -L 1
.
├── __pypackages__
├── __pycache__ # This is the folder that was auto-generated
├── addressbook.proto
├── addressbook_pb2.py
├── pdm.lock
├── protobuf-python-3.17.3.tar.gz
├── pyproject.toml
└── readme.txt

Now we'll look to see what is in that __pycache__ folder:

❯ ll __pycache__ # `ll` is my shortcut for `ls -al`, it's a common shortcut
total 8
drwxr-xr-x   3 mikewilliamson  staff    96B Oct 30 21:43 .
drwxr-xr-x  34 mikewilliamson  staff   1.1K Oct 30 21:43 ..
-rw-r--r--   1 mikewilliamson  staff   3.2K Oct 30 21:43 addressbook_pb2.cpython-39.pyc
❯ 

Notice that the file addressbook_pb2.cpython-39.pyc is in there. The stem is the name of the module (addressbook_pb2). But it also has the .cpython-39.pyc extension. This tells us a few things:

  1. It is compiled code... that's what the .pyc on the end means
  2. It is compiled using cpython-39, meaning that it is the CPython "flavor" of Python (the most ubiquitous), version 3.9.
Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
1

Python can execute functions written in Python (interpreted) and compiled functions. There are whole API docs about writing code for integration with Python. cython is one of the easier tools for doing this.

Libraries can be any combination - pure Python, Python plus interfaces to compiled code, or all compiled. The interpreted files end with .py, the compiled stuff usually is .so or .dll (depending on the operating system). It's easy to install pure Python code - just load, unzip if needed, and put the right directory. Mixed code requires a compilation step (and hence a c compiler, etc), or downloading a version with binaries.

Typically developers get the code working in Python, and then rewrite speed sensitive portions in c. Or they find some external library of working c or Fortran code, and link to that.

numpy and scipy are mixed. They have lots of Python code, core compiled portions, and use external libraries. And the c code can be extraordinarily hard to read.

As a numpy user, you should first try to get as much clarity and performance with Python code. Most of the optimization SO questions discuss ways of making use of the compiled functionality of numpy - all the operations that work on whole arrays. It's only when you can't express your operations in efficient numpy code that you need to resort to using a tool like cython or numba.

In general if you have to iterate extensively then you are using low level operations. Either replace the loops with array operations, or rewrite the loop in cython.

hpaulj
  • 221,503
  • 14
  • 230
  • 353