1

I have the following environment.yml file. It is taking 1.5 hours to create this environment. How to improve (or debug) the creation time?

name: test_syn_spark_3_3_1
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.10
  - pandas=1.5
  - pip=23.0
  - pyarrow=11.0.0
  - pyspark=3.3.1
  - setuptools=65.0
  - pip:
      - azure-common==1.1.28
      - azure-core==1.26.1
      - azure-datalake-store==0.0.51
      - azure-identity==1.7.0
      - azure-mgmt-core==1.3.2
      - azure-mgmt-resource==21.2.1
      - azure-mgmt-storage==20.1.0
      - azure-storage-blob==12.16.0
      - azure-mgmt-authorization==2.0.0
      - azure-mgmt-keyvault==10.1.0
      - azure-storage-file-datalake==12.11.0
      - check-wheel-contents==0.4.0
      - pyarrowfs-adlgen2==0.2.4
      - wheel-filename==1.4.1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327

1 Answers1

1

Switch the channel order and use Mamba. Specifically, I note that pyspark=3.3.1 is only available from Conda Forge, so the conda-forge channel should go first to avoid any channel_priority: strict masking issues. Mamba is faster, gives clearer error reporting, and the maintainers are very responsive.

test_syn_spark_3_3_1.yaml

name: test_syn_spark_3_3_1
channels:
  - conda-forge
  - defaults
# rest the same...

Create with Mamba (or micromamba):

## install mamba if needed
## conda install -n base -c conda-forge mamba
mamba env create -n test_syn_spark_3_3_1 -f test_syn_spark_3_3_1.yaml

This runs in a few minutes on my machine, which is mostly downloading time.


Other Thoughts

  1. I wouldn't ever impose a fixed constraint on pip or setuptools unless there is a specific bug you are avoiding. I'd probably at least loosen to use lower bounds.
  2. Conda Forge is fully self-sufficient these days - I would not only drop defaults, but even insulate against any channel mixing with nodefaults directive.
  3. I notice defaults channel prefers MKL for BLAS on x64 whereas Conda Forge defaults to OpenBLAS. So, you may want to explicitly declare your preference (e.g., accelerate on macOS arm64, mkl on Intel).

In summary, this is how I would write the YAML:

name: test_syn_spark_3_3_1
channels:
  - conda-forge
  - nodefaults    # insulate from user config
dependencies:
  ## Python Core
  - python=3.10
  - pip >=23.0
  - setuptool >=65.0

  ## BLAS
  ## adjust for hardware/preference
  - blas=*=mkl

  ## Conda Python pkgs
  - pandas=1.5
  - pyarrow=11.0.0
  - pyspark=3.3.1
  
  ## PyPI pkgs
  - pip:
    - azure-common==1.1.28
    - azure-core==1.26.1
    - azure-datalake-store==0.0.51
    - azure-identity==1.7.0
    - azure-mgmt-core==1.3.2
    - azure-mgmt-resource==21.2.1
    - azure-mgmt-storage==20.1.0
    - azure-storage-blob==12.16.0
    - azure-mgmt-authorization==2.0.0
    - azure-mgmt-keyvault==10.1.0
    - azure-storage-file-datalake==12.11.0
    - check-wheel-contents==0.4.0
    - pyarrowfs-adlgen2==0.2.4
    - wheel-filename==1.4.1
merv
  • 67,214
  • 13
  • 180
  • 245
  • First time I heard about mamba. Is it compatible with conda? – Aravind Yarram May 05 '23 at 20:44
  • 1
    It is a compiled reimplementation of Conda (plus some extras). Everything is still Conda environments, just a faster frontend. For DevOps stuff, there is also Micromamba, which is fully standalone tool for creating Conda environments. – merv May 05 '23 at 20:49
  • How do we know which BLAS to use for Apple M2 Pro? – Aravind Yarram May 05 '23 at 20:51
  • 1
    I use `accelerate` on my M1 (API is same for M2). There is a little benchmarking [in this thread](https://stackoverflow.com/q/70240506/570918). – merv May 05 '23 at 20:58