30

I've noticed that normally when packages are installed using various package managers (for python), they are installed in /home/user/anaconda3/envs/env_name/ on conda and in /home/user/anaconda3/envs/env_name/lib/python3.6/lib-packages/ using pip on conda.

But conda caches all the recently downloaded packages too.

So, my question is: Why doesn't conda install all the packages on a central location and then when installed in a specific environment create a link to the directory rather than installing it there?

I've noticed that environments grow quite big and that this method would probably be able to save a bit of space.

lahsuk
  • 1,134
  • 9
  • 20

1 Answers1

49

Conda already does this. However, because it leverages hardlinks, it is easy to overestimate the space really being used, especially if one only looks at the size of a single env at a time.

To illustrate the case, let's use du to inspect the real disk usage. First, if I count each environment directory individually, I get the uncorrected per env usage

$ for d in envs/*; do du -sh $d; done
2.4G    envs/pymc36
1.7G    envs/pymc3_27
1.4G    envs/r-keras
1.7G    envs/stan
1.2G    envs/velocyto

which is what it might look like from a GUI.

Instead, if I let du count them together (i.e., correcting for the hardlinks), we get

$ du -sh envs/*
2.4G    envs/pymc36
326M    envs/pymc3_27
820M    envs/r-keras
927M    envs/stan
548M    envs/velocyto

One can see that a significant amount of space is already being saved here.

Most of the hardlinks go back to the pkgs directory, so if we include that as well:

$ du -sh pkgs envs/*
8.2G    pkgs
400M    envs/pymc36
116M    envs/pymc3_27
 92M    envs/r-keras
 62M    envs/stan
162M    envs/velocyto

one can see that outside of the shared packages, the envs are fairly light. If you're concerned about the size of my pkgs, note that I have never run conda clean on this system, so my pkgs directory is full of tarballs and superseded packages, plus some infrastructure I keep in base (e.g., Jupyter, Git, etc).

merv
  • 67,214
  • 13
  • 180
  • 245
  • Can I ask why does the size of your envs change before and after you included ```pkgs```? – Tian Sep 24 '19 at 02:03
  • 2
    @Tian sure. That's because `pkgs` is the central repository for package code and much of what is hardlinked goes back to there. Everything goes into there first and gets linked out to the envs whenever possible. – merv Sep 24 '19 at 02:21
  • @merv For me, `for d in envs/*; do du -sh $d; done` and `du -sh envs/*` shows same result...(They both show 3~5GB for each env) Why is it so? Does it mean it never use hardlink thing? (I'm using `miniconda` and `conda` version is `4.8.3`) – user3595632 Jan 19 '21 at 09:32
  • @user3595632 Possibly. Is your package cache on a different file system? For example, check`conda config —show pkgs_dirs envs_dirs`. Or there are other config settings related to linking behavior that may be worth checking. – merv Jan 19 '21 at 16:00
  • Since different environments are likely to be created at different times, would it be accurate to say that package versions may differ between environments, thus limiting the actual sharing of files? – user2153235 May 19 '23 at 20:06
  • @user2153235 correct - there is no underlying mechanism to reuse existing packages if newer ones are available. One can use `conda create --clone` or create new environments using dumped environment definitions (`conda env export`, then `conda env create`) to exactly reuse another environment's packages. – merv May 19 '23 at 23:17
  • @merv: I don't quite follow that last comment. If package versions differ between environments, I wouldn't expect sharing of package files. If package versions are the same, however, it seems that hard links are used to share files. However, I expect package versions to often differ because environments aren't necessarily created at the same time. This makes it hard to share files. I was just hoping to get confirmation on this. Thanks. – user2153235 May 20 '23 at 17:38
  • Yes, I'm confirming your reasoning. The additional comments are just to say that there are ways to force using the same versions across environments - though they're not great. – merv May 21 '23 at 03:58