15

I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?

Is there any implementation to fit count data in Python in any packages?

vishmay
  • 386
  • 2
  • 4
  • 15
  • Better post your question on Stack Exchange https://stats.stackexchange.com/ – Gaurav Mar 26 '18 at 14:04
  • 2
    @GauravS: why? I think this fits perfectly to stackoverflow as it regards specific implementation in libraries and not the concept itself. – Marcus V. Mar 26 '18 at 14:38
  • @MarcusV.: There might be few folks who have already implemented it. May be they haven't contributed it to Python libraries, but can share the code. – Gaurav Mar 27 '18 at 03:01

3 Answers3

8

In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).

So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.

Marcus V.
  • 6,323
  • 1
  • 18
  • 33
  • Thanks for your answer. I am a complete beginner to Cython. Can you please point me to an example implementation or give more details? – vishmay Mar 26 '18 at 15:26
  • Well, the sklearn devs reference some links [here](http://scikit-learn.org/stable/developers/performance.html#performance-tips-for-the-cython-developer) and I have used [this](https://pythonprogramming.net/introduction-and-basics-cython-tutorial/) tutorial as a starter. – Marcus V. Mar 26 '18 at 15:51
  • @Prag1 How did you get on!? :\ – jtlz2 Nov 19 '18 at 09:39
  • 1
    The maintainers of sklearn should support custom loss functions, even if there's extra overhead from calling a python function that slows training down. I care more about being able to experiment flexibly. XGBoost can take a custom objective, and so can pytorch; it feels archaic that sklearn can't. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn I would try to implement it myself, but I doubt they'd merge my PR. The orthodoxy is that it's not a good idea, but orthodoxy shouldn't be taken seriously. – Pavel Komarov Jul 12 '21 at 18:18
0

If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.

For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.

0

In R, writing a custom objective function is fairly simple.

randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.

All you have to do is, write your own custom split rule, register the split rule, compile and install the package.

The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.

You can find more info here.

The file in which you define the split rule is this.