I am new to spark / pyspark and need to integrate it into a pipeline. I have managed to assemble the code that needs to be run in the terminal. Now, I would like to execute this code as a script. However, when I run python style pyspark -c cmds.py
I get Error: Invalid argument to --conf: cmds.py
.
I looked into spark-submit --master local cmds.py
but it returns
File "/path/cmd.py", line 4, in <module>
sparkValues = SQLContext.read.parquet('/a/file/ranks.parquet');
AttributeError: 'property' object has no attribute 'parquet'
What is the easiest solution here?
Here's cmds.py
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkValues = SQLContext.read.parquet('/a/file/ranks.parquet');
pandaValues = sparkValues.toPandas();
pandaValues.to_csv('/a/file/ranks.csv');
There might be a better way to convert the file to csv
, but Python is the easiest for me.
Solved:
This helped to implement the pyspark calls into my python pipeline. No need to have external call...