Why does PySpark think my file does not exist locally?

Question

I am trying to read a textfile using pyspark that lives locally, and it is telling me the file does not exist:

sc = SparkContext()
sc._conf.setMaster("local[*]")
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)

inpath='file:///path/to/file'
input_data = sqlContext.read.text(inpath)

and I get this:

Py4JJavaError: An error occurred while calling o52.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, <hostname>): java.io.FileNotFoundException: File file:/path/to/file does not exist

I understand that you need to make sure that you change the configurations for spark when you are reading files locally, when you are running this on a cluster. But, this is sitting on the master node, and the file does not need to be distributed across all nodes.

I checked out this question How to load local file in sc.textFile, instead of HDFS , and I tried the suggestion to set sc._conf.setMaster("local[*]") but that did not help - after restarting the spark context and rerunning it still does not work.

Is there any other setting I can change so that this can work?

score 0 · Answer 1 · answered Dec 05 '17 at 02:00

The spark processes are started when the SparkContext object is created. This means if you try to set configuration values after you have created it you are already too late. You should set any configuration values before creating the SparkContext. For example:

conf = SparkConf()
conf = conf.setMaster('local[*]')

sc = SparkContext(conf)

Alternatively, you can set the master either in your spark-default.conf file or using the "--master local" command line parameter when running spark with either spark-submit or pyspark.

Why does PySpark think my file does not exist locally?

1 Answers1