2

I am new to spark / pyspark and need to integrate it into a pipeline. I have managed to assemble the code that needs to be run in the terminal. Now, I would like to execute this code as a script. However, when I run python style pyspark -c cmds.py I get Error: Invalid argument to --conf: cmds.py . I looked into spark-submit --master local cmds.py but it returns

File "/path/cmd.py", line 4, in <module>
    sparkValues = SQLContext.read.parquet('/a/file/ranks.parquet');
AttributeError: 'property' object has no attribute 'parquet'

What is the easiest solution here? Here's cmds.py

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

sparkValues = SQLContext.read.parquet('/a/file/ranks.parquet');
pandaValues = sparkValues.toPandas();
pandaValues.to_csv('/a/file/ranks.csv');

There might be a better way to convert the file to csv, but Python is the easiest for me.


Solved:

This helped to implement the pyspark calls into my python pipeline. No need to have external call...

Serzhan Akhmetov
  • 2,898
  • 2
  • 13
  • 26
El Dude
  • 5,328
  • 11
  • 54
  • 101

1 Answers1

1

I am answering a bit late but if you are trying something in pyspark 2.0.0 the below thing might help.

submit the pyspark code :

spark-submit --master mastername samplecode.py

if you have yearn installed, or if you are using AWS EMR you dont have to mention the master as yarn will take care of it.

The code inside the samplecode.py would look like something below:

# intialize sparkSession
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark =SparkSession.builder.config(conf=SparkConf()).appName("yourappnam").getOrCreate()
df = spark.sql("select * from abc")
braj
  • 2,545
  • 2
  • 29
  • 40