sklearn internals access cython classes and functions

Question

I am interested in testing out many of the internal classes and functions defined within sklearn (eg. maybe add print statement to the treebuilder so I can see how the tree got built). However as many of the internals were written in Cython, I want to learn what is the best practices and workflows of testing out the functions in Jupyter notebook.

For example, I managed to import the Stack class from the tree._utils module. I was even able to construct it but unable to call any of the methods. Any thoughts on what I should do in order to call and test the cdef classes and its methods in Python?

%%cython 
from sklearn.tree import _utils
s = _utils.Stack(10)
print(s.top())
# AttributeError: 'sklearn.tree._utils.Stack' object has no attribute 'top'

@juanpa.arrivillaga, sorry for the confusion, just modified the question. — B.Mr.W., Aug 28 '19 at 17:52
To access cdef-functionality you need to cimport instead of import. — ead, Aug 28 '19 at 18:49
@ead, `from sklearn.tree._utils cimport Stack` failed with compilation error `self._compile(obj, src, ext, cc_args, extra_postargs, pp_opt gcc failed with exit status code 1`. — B.Mr.W., Aug 28 '19 at 22:51

ead · Accepted Answer · 2022-04-06T09:32:26.533

There are some problems which must be solved in order to be able to use c-interfaces of the internal classes.

First problem (skip if your sklearn version is >=0.21.x):

Until version 0.21.x sklearn used implicit relative imports (as in Python2), compiling it with Cython's language_level=3 (default in IPython3) would not work - so setting language_level=2 is needed for versions < 0.21.x (i.e. %cython -2) or even better, scikit-learn should be updated.

Second problem:

We need to include path to numpy-headers. Let's take a look at a simpler version:

%%cython 
from sklearn.tree._tree cimport Node
print("loaded")

which fails with nothing saying error "command 'gcc' failed with exit status 1" - but the real reason can be seen in the terminal, where gcc outputs its error message (and not to notebook):

fatal error: numpy/arrayobject.h: No such file or directory compilation terminated.

_tree.pxd uses numpy-API and thus we need to provide the location of numpy-headers.

That means we need to add include_dirs=[numpy.get_include()] to Extension definition. There are two ways to do it in %%cython-magic, via -I option:

%%cython -I <path from numpy.get_include()>
...

or somewhat dirtier trick, exploiting that %%cython magic will add the include automatically when it sees string "numpy", by adding a comment like

%%cython 
# requires numpy headers
...

is enough.

Last but not least:

Note: since 0.22 this is no longer an issue as pxd-files are included into the installation (see this).

The pxd-files must be present in the installation for us to be able to cimport them. This is the case for pxd-files from the sklearn.tree subpackage, as one can see in the local setup.py-file (given this PR, this seems to be more or less a random decision without a strategy behind):

...
config.add_data_files("_criterion.pxd")
config.add_data_files("_splitter.pxd")
config.add_data_files("_tree.pxd")
config.add_data_files("_utils.pxd")
...

but not for some other cython-extensions, in particular not for sklearn.neighbors-subpackage. Now, that is a problem for your example:

%%cython 
# requires numpy headers 
from sklearn.tree._utils cimport Stack
s = Stack(10)
print(s.top())

fails to be cythonized, because _utils.pxd cimports data structures from neighbors/*.pxd's:

...
from sklearn.neighbors.quad_tree cimport Cell
...

which are not present in the installation.

The situation is described with more details in this SO-post, your options to build are (as described in the link)

copy pdx-files to installation
reinstall from the downloaded source with pip install -e
reinstall from the downloaded source after manipulating corresponding local setup.py-files.

Another option is to ask the developers of sklearn to include pxd-files into the installation, so not only building but also distribution becomes possible.

quite a challenge for me due to the lack of cython and setuptools experience, can you further elaborate how to resolve the dependencies? I understand that pdx is definition file and pyx is implementation file, _utils.pxd uses the Cell struct from neighbors.quad_tree.pxd. In your three options listed above, when you say installation, do mean run `python setup.py develop` at the sklearn root directory or the specific local setup.py at submodue, say `sklearn.tree` folder. Shame on me, I have not even get any one out of the three options working, did you? — B.Mr.W., Aug 29 '19 at 17:24
@B.Mr.W. Not sure, what your problem is: You have uninstalled your current version, cloned sclearn repository and followed https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge? If there is still a problem, best ask a new question as comments are not great for trouble shooting. — ead, Aug 29 '19 at 21:10
one more question before I open up a new thread. I found all the dependencies like `neighbours.Cell` and added it directly to the _utils.pxd and _utils.pyx, so in the end, there should be no dependencies/reference issue. Then I created my own minimal setup.py per your suggestion. It compiles without any problem and I managed to generate a _utils.so file. However, when I use the %%cython magic to run your code, it certainly calls the Stack properly as it will complain if you don't pass the capacity into the cinit constructor, however, it cannot find attributes like top, any thought? — B.Mr.W., Sep 04 '19 at 16:32
@B.Mr.W. Without a precise [mcve] this is bound to become a series of misunderstandings and loss of everybody’s time, so once again I decline to trouble shoot in comments — ead, Sep 04 '19 at 17:47
thanks. I was lucky to get it working just a few minutes ago, to replace `stack = Stack(10)` with `cdef Stack stack = Stack(10)`, your code should work. Not quite sure why. — B.Mr.W., Sep 04 '19 at 17:52
@B.Mr.W. Thanks, I never know when cython deduces the type of a variable and when not and handles it as a simple object. Using cdef Stack is safer. — ead, Sep 04 '19 at 20:20
Would not it be interesting to provide open-sourcely a forked version of sklearn on Github with a complete setup done intelligently (meaning an env file might be edited to make sure every path fit to the local user) so that everyone can begin adding a basic print a little bit wherever to begin experimenting with the cython side of sklearn? Then being able to genuinely enhance a portion of it? This is such a shambles, we cannot lie! Due mostly to Sklearn's designers, progress is sluggish despite the fact that it is facilitated by folks like you - god bless you — Simon Provost, Apr 01 '23 at 16:55

sklearn internals access cython classes and functions

1 Answers1

Linked