8

I would like to use a pretrained xgboost classifier in pyspark but the nodes on the cluster don't have the xgboost module installed. I can pickle the classifier I have trained and broadcast it but this isn't enough as I still need the module to be loaded at each cluster node.

I can't install it on the cluster nodes as I don't have root and there is no shared file system.

How can I distribute the xgboost classifier for use in spark?


I have an egg for xgboost. Could something like http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-td7059.html or https://stackoverflow.com/a/24686708/2179021 work?

Community
  • 1
  • 1
Simd
  • 19,447
  • 42
  • 136
  • 271
  • Do you have SSH access to the individual machines? Which cluster manager do you use? – zero323 Sep 30 '16 at 20:13
  • 1
    @zero323 We use YARN but I don't have ssh access to the machines sadly. I think what I need to do is find a solution that involves broadcasting the 'egg'. – Simd Sep 30 '16 at 20:43
  • My honest advice is to find the person responsible on force it to either provide you with required libraries, or with configurable environment (like Anaconda installations). Having correctly build and configured native dependencies is not only about your comfort, but also about basic performance. And differences can be quite significant. – zero323 Sep 30 '16 at 21:17
  • @zero323 We do have anaconda installed on each cluster node. Does that potentially help? – Simd Sep 30 '16 at 22:32
  • Well, if you're up to for some hacky solutions... (I mean really hacky, I assume you don't mean Anaconda Cluster which). So long story short - create packages as describe in the Anaconda docs (if architecture is heterogeneous I assume you can handle cross compiling). There are some existing packages as well. When you run job, just try to import, and if package is not accessible, just install it from the job itself. – zero323 Sep 30 '16 at 23:47
  • The idea is similar to this one http://stackoverflow.com/q/34376323/1560062 – zero323 Sep 30 '16 at 23:47
  • @zero323 Oh right I just meant the python package https://docs.continuum.io/anaconda/ . – Simd Oct 01 '16 at 06:52
  • Well, you still should be able to install packages there from a Spark tasks. Like I said it is hacky, and requires some defensive programming, but works. – zero323 Oct 01 '16 at 12:38

1 Answers1

2

There is a really good blog post from Cloudera explaining this matter. All credits go to them.

But just to answer your question in short - no, it's not possible. Any complex 3rd party dependency needs to be installed on each node of your cluster and configured properly. For simple modules/dependences one might create *.egg, *.zip or *.py files and supply them to the cluster with --py-files flag in spark-submit.

However, xgboost is a numerical package that depends heavily not only on other Python packages, but also specific C++ library/compiler - which is low level. If you were to supply compiled code to the cluster, you could encounter errors arising from different hardware architecture. Adding the fact that clusters are usually heterogenous in terms of hardware, doing such thing would be a very bad thing.

bear911
  • 349
  • 2
  • 8
  • Thanks for this. Can you give any more details where the hardware is homogeneous and you have the xgboost egg? – Simd Sep 27 '16 at 20:19
  • 1
    Unfortunately, no. I have never used it in such a way. To be completely honest, it is probably not going to work because of the library complexity. The approach with egg might only work for simple packages. On top of that, if you want this to be in production, you need to find another way. If not, then you could probably get access to a cluster and install Python yourself. I would steer away from egg approach. – bear911 Sep 27 '16 at 20:31