Creating a cheap "overlay" Python virtual environment on top of current one without re-installing packages

Question

Problem:

My program is getting a list of (requirements_123.txt, program_123.py) pairs (actually a list of script lines like pip install a==1 b==2 c==3 && program_123.py).

My program needs to run each program in an isolated virtual environment based on the current environment.

Requirements:

Current environment is not modified
Program environment is based on the current environment
Not reinstalling the packages from the current env. It's slow. It does not really work (package sources might be missing, build tools might be missing). No pip freeze | pip install please
Fast. Copying gigabytes of files from current environment to a new environment every time is too slow. Symlinking might be OK as a last resort.

Ideal solution: I set some environment variables for each program, pointing to a new virtual environment dir, and then just execute the script and pip does the right thing.

How can I do this?

What do I mean by "overlay": Python already has some "overlays". There are system packages and user packages. User packages "shadow" the system packages, but non-shadowed system packages are still visible to the programs. When pip installs the packages in the user directory, it does not uninstall the system package version. This is the exact behavior I need. I just need a third overlay layer: "system packages", "user packages", "program packages".

Related questions (but they do not consider the user dir packages, only the virtual environments):

"Cascading" virtual Python environnements Is it possible to create nested virtual environments for python?

P.S.

If pip freeze doesn't even work, you have much larger problems lurking.

There are many reasons why the result of pip freeze > requirements.txt does not work in practice:

System-installed packages installed using apt.
Packages installed from different package indexes, not PyPI (PyTorch does that). Package conda-package-handling is not on PyPI.
Conda packages.
Packages built from source some time ago (and your compilers are different now).
Installs from git or zip/whl files.
Editable installs.

I've just checked a default notebook instance in Google Cloud and almost half of the pip freeze list looks like this:

threadpoolctl @ file:///tmp/tmp79xdzxkt/threadpoolctl-2.1.0-py3-none-any.whl
tifffile @ file:///home/conda/feedstock_root/build_artifacts/tifffile_1597357726309/work

Also packages like conda-package-handling are not even on PyPI.

Anyways, this is just one of the many reasons why pip freeze | pip install does not work in practice.

....why? If these are independent programs, just run each in their own venv? That's literally what a virtual environment is for? — Mike 'Pomax' Kamermans, Feb 20 '23 at 22:54
I do not think it is feasible. Maybe try with the *conda* ecosystem, as far as I understood if you have 2 environments with the same library, it still has only 1x the disk space footprint. -- Otherwise, maybe you can do things with symlinks and/or with `.pth` files, but it is a bit complicated. -- I am pretty sure this question has been asked multiple times here before, maybe there are some better ideas in the answers. — sinoroc, Feb 21 '23 at 09:51
So every time your program runs, you need to install a different sets of packages, which may not be compatible which each other? Or, do you just install your "master program" once? Why does it have to be fast? Are you creating some sort of custom CI / automated testing system ..? — Niko Föhr, Feb 21 '23 at 09:57
https://stackoverflow.com/q/74436125 -- https://stackoverflow.com/q/50953575 -- https://stackoverflow.com/q/61019081 — sinoroc, Feb 21 '23 at 09:58
@Mike'Pomax'Kamermans I cannot recreate/clone the user's env. `pip freeze | pip install -` does not work in practice. Installed packages are not buildable or not installable. I do not know how they were installed, but they work in the original env. — Ark-kun, Feb 22 '23 at 01:28
@np8 Imagine that the user env has some big, harder to install packages like Tensorflow that I want to avoid reinstalling every time. The requirements.txt might have few smaller packages like pandas etc. Here is what the system really is: It's an ML pipelines system where everything naturally runs in isolated containers. But some people do not have Docker installed. People ask to make it possible to run the same without Docker. We cannot just run the program in empty venv naively since some packages like TF come from the base image. The base user env "emulates" the base container image. — Ark-kun, Feb 22 '23 at 01:36
Since python already has system packages, user packages and sys.path and pip has prefix I'm almost certain it's possible to have a third package directory. Trick pip to install to a new directory as if it was the user packages directory and the n use that dir in sys.path. — Ark-kun, Feb 22 '23 at 01:37
One solution could be to use a standalone python installation (e.g. [python-build-standalone](https://github.com/indygreg/python-build-standalone)) with common large packages (e.g. Tensorflow) installed there, and use venvs created with the [`--system-site-packages`](https://docs.python.org/3/library/venv.html) flag. In virtual environments, use `-I` or `-U` flag when installing the (pruned) dependencies. — Niko Föhr, Feb 22 '23 at 07:39
Honestly, it sounds like you need to fix that setup, not go "how do I keep working with this broken system". If `pip freeze` doesn't even work, you have much larger problems lurking. — Mike 'Pomax' Kamermans, Feb 22 '23 at 15:58
...So the environment you want to link to is in a Docker image, but the user doesn't have Docker installed? How's their Python interpreter supposed to access those packages at all? — CrazyChucky, Feb 23 '23 at 13:42
@Mike'Pomax'Kamermans >If pip freeze doesn't even work, you have much larger problems lurking. Pip freeze works. Pip install on the other hand is not as reliable even in normal situations. I already covered this in the question. System-installed packages installed using `apt`. Packages installed from different package indexes, not PyPI (PyTorch does that). Conda packages. Packages built from source some time ago (and your compilers are different now). Installs from git or zip/whl files. Editable installs. There are all sorts of scenarios. — Ark-kun, Apr 22 '23 at 00:08
@CrazyChucky The high-level user view is: "User: This component uses the 'tensorflow/tensorflow:latest' base container image. I already have TF installed locally. The component should just use the package I have." — Ark-kun, Apr 22 '23 at 00:14
So you've made tools available to ease the installation and use of consistent environments. Some of your users don't want to use those tools, but also don't want to deal with the resulting complexities themselves. They want to have their cake and eat it too, don't they? — CrazyChucky, Apr 22 '23 at 00:43
@CrazyChucky Yes. I want people to be able to use my product. As much as I like containers, the strongest feedback I have received is that some users do not want to install Docker or cannot install it (no root access). I target the Notebook users. For example, people who use Colab or some managed Jupyter instances cannot access Docker even when it's installed since the notebook itself runs inside a container and the Docker socket is not enabled. BTW, here is the product I'm building: https://cloud-pipelines.net/sdk/orchestration/interactive/#examples — Ark-kun, Apr 22 '23 at 01:59
Ohh, I had the mistaken impression that this was like... some sort of in-house tool, and your users were at the same company or org. This makes more sense now. — CrazyChucky, Apr 22 '23 at 02:34

score 3 · Answer 1 · answered Feb 23 '23 at 13:33

You can add a .pth file (a site module feature) to the site packages directory of your derived virtual environment with a line pointing to the site-packages path of your base virtual environment.

In shell, you can do it like this:

# Assumes that the base virtual environment exists, activate it.
. base/bin/activate

# Create the derived virtual environment.
python -m venv ./derived

# Make the derived virtual environment import base's packages too.
base_site_packages="$(python -c 'import sysconfig; print(sysconfig.get_paths()["purelib"])')"
derived_site_packages="$(./derived/bin/python -c 'import sysconfig; print(sysconfig.get_paths()["purelib"])')"
echo "$base_site_packages" > "$derived_site_packages"/_base_packages.pth

^{base_site_packages is usually base/lib/python<VERSION>/site-packages, the code to get it is taken from https://stackoverflow.com/a/46071447/3063 – same for derived_site_packages.}

The packages installed in the base environment will be available in the derived environment. You can verify this by doing pip list in the derived environment.

# Deactivating the base environment is optional,
# meaning that the derived environment can be activated directly too.
deactivate

. ./derived/bin/activate
pip list

To install your custom Python packages and run your script in the custom environment, you don't necessarily need to activate the derived environment. You can call the derived Python environment's pip and python directly and it should just work:

./derived/bin/pip install a==1 b==2 c==3
./derived/bin/python program_123.py

Thank you. Looks like this should work. (We'll need to add the user packages to the `.pth` file too.) — Ark-kun, Apr 22 '23 at 02:00
@Ark-kun can you please mark the answer as accepted in this case? — palotasb, Apr 24 '23 at 14:02

Creating a cheap "overlay" Python virtual environment on top of current one without re-installing packages

1 Answers1