2

We use Spark through Spark API with pyspark. I know Numba can make python code very fast so it is a good thing to use it on our udfs (user-defined function) but I'm not sure Numba decorator still works on the executors, for example with map or mapPartion.

Can Numba work with pyspark udfs? Should it work even when the job is sent to the workers (executors)?

idan ahal
  • 707
  • 8
  • 21
  • Please read the doc: https://numba.readthedocs.io/en/stable/user/5minguide.html#will-numba-work-for-my-code (literally the first main page) https://numba.readthedocs.io/en/stable/reference/pysupported.html and https://numba.readthedocs.io/en/stable/reference/numpysupported.html. The answer is generally no for external libraries. – Jérôme Richard Jun 13 '22 at 09:46
  • It seems like you didn't understand the question. I'll edit it. The functions are implemented with pure python and numpy, but I'm not sure that after the functions are sent with mapPartion (for example) to the executors will it still work and compile the code? – idan ahal Jun 13 '22 at 11:14
  • Ok, so there is no reason for this not to work if PySpark do not do something special with the functions. The Numba functions must be compiled locally (the Numba bytecode is not portable). This post may help: https://numba.discourse.group/t/numba-and-pyspark-users/1318/4 . – Jérôme Richard Jun 13 '22 at 12:29
  • The problem was not much about clarity but research effort. People regularly ask about what is supported in Numba without reading the main pages of the doc. This is not so simple here so let's forget about this for this time. Note however that a good question should describe what did you tried (see [how-to-ask](https://stackoverflow.com/help/how-to-ask)) and note that the above links was found in only few minute on a basic search engine. – Jérôme Richard Jun 13 '22 at 16:21
  • 1
    There is surprisingly little information about it. Few blog articles, all of which do not demonstrate using jitted code on a cluster. Some github/forum issues. Two youtube videos about pygdf (GPU acceleration with numba on a pyspark cluster). Unfortunately you would need a pyspark cluster for a tested solution to a concrete question. @JérômeRichard - The linked answer uses the legacy rdd api, not the dataframe api with pandas-udf support. Nevertheless interesting. – Michael Szczesny Jun 13 '22 at 18:54
  • 1
    Exactly, I did read all that mentioned above and saw those youtube videos. There is no information about this at all, even less when you need CPU acceleration (Not GPU). I'll keep trying and publish the results here. – idan ahal Jun 14 '22 at 09:30
  • I think it is better to consider Numba as an independent tool. Otherwise, yes, you will not find a lot of documentation about the use of both in a same application. AFAIK, there is no reason to believe Numba will behave differently when used with PySpark in the target case. One just need to make sure that functions are compiled on each target machine (no remote JIT code transfer) which should be the case by default. Numba functions are just like any compiled CPython function (they are a wrapper to the JIT compiled function defined by LLVM-Lite which might potentially be a problem though). – Jérôme Richard Jun 14 '22 at 17:25

0 Answers0