I'm trying to run automatic machine learning model benchmarking code from the pycaret
python package, which in its turn uses scikit-learn
, among others.
However, my conda environment seems to run the scikit-learn
dependency installed in the .local
folder in my own home directory, rather than from my conda environment's packages.
That is not what I expected and it leads to the called pycaret
code crashing because the interface of the loaded scikit-learn
' is not the interface it expected.
I've found a way to change that behavior by editing sys.path
(see below), but I don't understand why my conda environment apparently prefers .local
folder's packages first, instead of the installed conda environment packages.
I have checked the sys.path of a conda environment created at a different organization and there the order is not like the above, so it clearly prefers the conda environment packages first.
This weird behavior clearly causes runtime errors and I don't want to edit sys.path
in every Jupyter notebook that I start here at this organization. Can someone tell me which configuration sets this behavior, so I understand and can avoid having to indeed edit sys.path
everytime?
A minimal code example of what I'm running is:
import pandas as pd
from pycaret.regression import *
df_basetable = pd.read_csv('df_basetable.csv')
random_seed = 14
regr_exp1 = setup(
data=df_basetable[df_basetable["split_level1"]=="full train"],
target="my_prediction_target",
ignore_features=["customer_id"],
numeric_features=[col for col in df_basetable.columns if col not in ["my_prediction_target", "customer_id"],
test_data=df_basetable[df_basetable["split_level1"]=="validation"],
fold_strategy = 'kfold',
fold=5,
fold_shuffle=True,
n_jobs=5,
session_id=random_seed, # for reproducibility
)
That results in the following error:
File ~/.local/lib/python3.9/site-packages/sklearn/base.py:211, in BaseEstimator.get_params(self, deep)
209 out = dict()
210 for key in self._get_param_names():
--> 211 value = getattr(self, key)
212 if deep and hasattr(value, "get_params") and not isinstance(value, type):
213 deep_items = value.get_params().items()
AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'
What strikes me is that the interpreter is crashing on File ~/.local/lib/python3.9/site-packages/sklearn/base.py:211
: I expected to see the path listed here of the sklearn package installed on the conda environment that I'm using.
I am running the above python code within a Jupyter Lab notebook with the conda environment activated that I want to use (it's called model_dashboard
).
Proof of that is:
import sys
print(sys.executable)
That prints '/applis/xyz/.envs/model_dashboard/bin/python'
.
The python paths involved however, are the following:
import sys
sys.path
===>
['/applis/abc/notebooks',
'/applis/xyz/.envs/model_dashboard/lib/python39.zip',
'/applis/xyz/.envs/model_dashboard/lib/python3.9',
'/applis/xyz/.envs/model_dashboard/lib/python3.9/lib-dynload',
'',
'/home/users/a12345/.local/lib/python3.9/site-packages',
'/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages']
That seems weird to me. I would expect at least to see the packages from my conda environment ( '/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages'
) appearing before my .local python install's packages ('/home/users/a12345/.local/lib/python3.9/site-packages'
), not after. I'm even wondering why the .local packages are in the python path at all, I think I don't really need this.
So I tried restarting the kernel, then putting the .local
last instead:
['/applis/abc/notebooks',
'/applis/xyz/.envs/model_dashboard/lib/python39.zip',
'/applis/xyz/.envs/model_dashboard/lib/python3.9',
'/applis/xyz/.envs/model_dashboard/lib/python3.9/lib-dynload',
'',
'/applis/xyz/.envs/model_dashboard/lib/python3.9/site-packages',
'/home/users/a12345/.local/lib/python3.9/site-packages'
]
When printing sys.path, I see that the modification has worked, .local
will now provide packages as last option.
With the kernel restarted and the above sys.path changes, when I run the pycaret code again, I no longer have the sklearn error.
So, good news, but a question remains: what is the .local
folder meant for - I suppose just as the default python installation - and what configuration causes a system to prefer .local packages before a conda environment's packages in every Jupyter notebook that I start? It must be some configuration that I don't know about, since I could see in a different organization that the .local
path was not even in ´sys.path´ when I'm printing it in a Jupyter notebook there.
I'd like to modify this configuration so I don't need to modify sys.path
in every notebook.