Nifi: Cannot import pyspark in ExecuteScript processor

Question

I need to implement ExecuteScript in Nifi in order to do column transposition, and I am using pyspark as means to do that.

But the problem says "failed to process due to javax.script.ScriptExeption: ImportError: No module named pyspark in at line number 1:"

I set the path to spark and pyspark like this for module directory setting in ExecuteScript property.

C:\Users\username\Desktop\spark\spark-2.4.3-bin-hadoop2.7\hadoop,
C:\Users\username\Desktop\spark\spark-2.4.3-bin-hadoop2.7\bin\pyspark

But it did not work.

I am afraid this is very fundamental issue, could not figure out half a day..

Are you able to run 'normal' python code this way? Are you able to run the script with pyspark manually on all the relevant nodes? — Dennis Jaheruddin, May 27 '19 at 09:38
For property configuration of ExecuteScript processer, I set python for Script Engine, set the path where my python code with pyspark module locates for Script File, and Module Directory as well as I mentioned above. Followed these examples in the [link](https://community.hortonworks.com/articles/75545/executescript-cookbook-part-2.html) — Micro_Andy, May 28 '19 at 00:58

score 2 · Accepted Answer · answered May 28 '19 at 17:26

This is likely because the pyspark module is a natively-compiled Python module, and Apache NiFi uses Jython in the ExecuteScript processor. This is a known issue, and the full explanation is here, as well as some work-arounds and details on options.

The simplest answer is to use ExecuteStreamCommand and pass the necessary flowfile attributes as arguments, and the content as STDIN. The output of the Python script will be returned via STDOUT and captured as the new flowfile content.

Thank you. I wrote python script with pandas module and it worked with ExecuteStreamCommand! — Micro_Andy, May 29 '19 at 05:49

Nifi: Cannot import pyspark in ExecuteScript processor

1 Answers1