0

I'm trying to run automatic machine learning model benchmarking code from the pycaret python package, which in its turn uses scikit-learn, among others.

However, my conda environment seems to run the scikit-learn dependency installed in the .local folder in my own home directory, rather than from my conda environment's packages.

That is not what I expected and it leads to the called pycaret code crashing because the interface of the loaded scikit-learn' is not the interface it expected.

I've found a way to change that behavior by editing sys.path (see below), but I don't understand why my conda environment apparently prefers .local folder's packages first, instead of the installed conda environment packages.

I have checked the sys.path of a conda environment created at a different organization and there the order is not like the above, so it clearly prefers the conda environment packages first.

This weird behavior clearly causes runtime errors and I don't want to edit sys.path in every Jupyter notebook that I start here at this organization. Can someone tell me which configuration sets this behavior, so I understand and can avoid having to indeed edit sys.path everytime?

A minimal code example of what I'm running is:

import pandas as pd
from pycaret.regression import *

df_basetable = pd.read_csv('df_basetable.csv')
random_seed = 14
regr_exp1 = setup(
    data=df_basetable[df_basetable["split_level1"]=="full train"], 
    target="my_prediction_target", 
    ignore_features=["customer_id"],
    numeric_features=[col for col in df_basetable.columns if col not in ["my_prediction_target", "customer_id"],
    test_data=df_basetable[df_basetable["split_level1"]=="validation"],
    fold_strategy = 'kfold',
    fold=5,
    fold_shuffle=True,
    n_jobs=5,
    session_id=random_seed, # for reproducibility
)

That results in the following error:

File ~/.local/lib/python3.9/site-packages/sklearn/base.py:211, in BaseEstimator.get_params(self, deep)
    209 out = dict()
    210 for key in self._get_param_names():
--> 211     value = getattr(self, key)
    212     if deep and hasattr(value, "get_params") and not isinstance(value, type):
    213         deep_items = value.get_params().items()

AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

What strikes me is that the interpreter is crashing on File ~/.local/lib/python3.9/site-packages/sklearn/base.py:211: I expected to see the path listed here of the sklearn package installed on the conda environment that I'm using.

I am running the above python code within a Jupyter Lab notebook with the conda environment activated that I want to use (it's called model_dashboard). Proof of that is:

import sys
print(sys.executable)

That prints '/applis/xyz/.envs/model_dashboard/bin/python'.

The python paths involved however, are the following:

import sys
sys.path

===>

['/applis/abc/notebooks',
 '/applis/xyz/.envs/model_dashboard/lib/python39.zip',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9/lib-dynload',
 '',
 '/home/users/a12345/.local/lib/python3.9/site-packages',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages']

That seems weird to me. I would expect at least to see the packages from my conda environment ( '/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages') appearing before my .local python install's packages ('/home/users/a12345/.local/lib/python3.9/site-packages'), not after. I'm even wondering why the .local packages are in the python path at all, I think I don't really need this.

So I tried restarting the kernel, then putting the .local last instead:

['/applis/abc/notebooks',
 '/applis/xyz/.envs/model_dashboard/lib/python39.zip',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9/lib-dynload',
 '',
 '/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages',
 '/home/users/a12345/.local/lib/python3.9/site-packages'
]

When printing sys.path, I see that the modification has worked, .local will now provide packages as last option.

With the kernel restarted and the above sys.path changes, when I run the pycaret code again, I no longer have the sklearn error.

So, good news, but a question remains: what is the .local folder meant for - I suppose just as the default python installation - and what configuration causes a system to prefer .local packages before a conda environment's packages in every Jupyter notebook that I start? It must be some configuration that I don't know about, since I could see in a different organization that the .local path was not even in ´sys.path´ when I'm printing it in a Jupyter notebook there.

I'd like to modify this configuration so I don't need to modify sys.path in every notebook.

Sander Vanden Hautte
  • 2,138
  • 3
  • 22
  • 36
  • Am I correct to read this as a "why was this design decision made?" question, as opposed to a "how do I fix this technical problem?" question? (Second-guessing projects' design decisions or going into their history is not really something our format is well suited to, and there's [a history of closing questions of that variety](https://meta.stackexchange.com/questions/170394)) – Charles Duffy Dec 08 '22 at 16:09
  • ...that said, providing a last-level high-priority local override is not at all uncommon practice for layered configuration resolution _in general_, without even looking at specific projects like conda. There's a reason that, say, systemd configuration prioritizes content in `/run` over `/var`, `/var` over `/etc` and `/etc` over `/usr`; the more transient and locally-controlled something is, the more likely it is to reflect user intent of the person doing on-the-ground administration of the specific host at hand. Big-picture general principle, not anything really conda-specific. – Charles Duffy Dec 08 '22 at 16:10
  • Granted, "local" can mean different things: In your case, you'd prefer that "local to the project" was preferred over "local to the host". The right way to make that happen would have been to be in the room when the argument was being fought. As it is, someone else's opinions prevailed -- perhaps because they had experience with situations where they needed a host-specific library override (it happens sometimes, particularly if your library is dealing with something like hardware interfacing), perhaps because the counterargument was never entered into discussion. – Charles Duffy Dec 08 '22 at 16:13
  • ...but if we're just going to be guessing over what someone else was thinking, that's pretty thoroughly in the "opinion-based" bucket. A more constructive approach would be to engage with conda's developers directly to ask if they'd consider changing the behavior going forward. – Charles Duffy Dec 08 '22 at 16:15
  • Hi @CharlesDuffy, my question was based on seeing that for another environment, set up at a different organization, the python path list is ordered in the way that I did expected, and now, with this environment created at a different organization, the path list is different and therefore causes errors to happen. Therefore, it's more a "how does the path come to existence in this way" (what is the configuration that I don't know that creates such a path order) and then, when understanding this, being able to work around the library run problems now and in the future. – Sander Vanden Hautte Dec 08 '22 at 16:19
  • Adding to that: ideally I'd not like to modify the python path in every notebook that I'm creating, so I want to understand what causes this, and how I can work around it. – Sander Vanden Hautte Dec 08 '22 at 16:20
  • I clarified my question for the point above. – Sander Vanden Hautte Dec 08 '22 at 16:43
  • Does the info in this post help? https://stackoverflow.com/questions/62352699/conda-uses-local-packages – nigh_anxiety Dec 08 '22 at 17:01
  • Apologies about taking some days to catch up on comments -- I agree that with the edits, you're now asking an actionable question. Coming to withdraw my close-as-opinion-based vote, though, I see that the question is closed as a duplicate. Do you think the specific linked duplicate covers it adequately? – Charles Duffy Dec 11 '22 at 22:35
  • Thanks for your comments. No problem, it also took me some time to get back to this, after my end-of-year tasks changed a bit. Actually the question that helped most was the one linked to the duplicate banner on top here: https://stackoverflow.com/questions/70958434/unexpected-python-paths-in-conda-environment. Thanks to you and the duplicate link poster for the input! – Sander Vanden Hautte Dec 22 '22 at 09:33

0 Answers0