I'm developing a web application retrieving data from data lake, the data is stored in HDFS and I want to use pyspark to perform some analysis. In other words we have a script within ipython notebook and we want to use it with Django. I see that pyspark is also available at pypi, so I installed it with pip and the same script is imported as .py
file from notebook is running fine, when I run it as python myscript.py
it works fine. Hence, it should also work fine if I import that script within Django. So, is it the correct method, or I will have to run spark-submit myscript.py
? I want to use Spark in cluster mode.
Asked
Active
Viewed 2,633 times
3

Faizan Ali
- 973
- 2
- 16
- 32
-
Did you found the way to run it? I got stuck with the same problem. – AshrithGande Mar 16 '18 at 07:10
-
@AshrithGande use findspark , https://github.com/minrk/findspark – kinkajou Mar 16 '18 at 09:10
-
@AshrithGande https://stackoverflow.com/a/34763240/2214674 – kinkajou Mar 16 '18 at 09:10
-
I'm using findspark, but I cannot load my model `model = RandomForestRegressionModel.load('model/')` what I that spark-submit you mentioned? – Fabio Magarelli Jul 20 '19 at 16:58