Load local file (not HDFS) fails at Spark

Question

I have one question - how to load local file (not on HDFS, not on S3) with sc.textFile at PySpark. I read this article, then copied sales.csv to master node's local (not HDFS), finally executed following

sc.textFile("file:///sales.csv").count()

but it returns following error, saying file:/click_data_sample.csv does not exist

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 10, ip-17x-xx-xx-xxx.ap-northeast-1.compute.internal): java.io.FileNotFoundException: File file:/sales.csv does not exist

I tryed file://sales.csv and file:/sales.csv but both also failed.

It is very helpful you give me kind advice how to load local file.

Noted1:

My envrionment is Amazon emr-4.2.0 + Spark 1.5.2.
All ports are opened

Noted2:

I confirmed load file from HDFS or S3 works.

Here is the code of loading from HDFS - download csv, copy to hdfs in advance then load with sc.textFile("/path/at/hdfs")

commands.getoutput('wget -q https://raw.githubusercontent.com/phatak-dev/blog/master/code/DataSourceExamples/src/main/resources/sales.csv')
commands.getoutput('hadoop fs -copyFromLocal -f ./sales.csv /user/hadoop/')
sc.textFile("/user/hadoop/sales.csv").count()  # returns "15" which is number of the line of csv file

Here is the code of loading from S3 - put csv file at S3 in advance then load with sc.textFile("s3n://path/at/hdfs") with "s3n://" flag.

sc.textFile("s3n://my-test-bucket/sales.csv").count() # also returns "15"

You need to run spark shell as --master local. Then you can read the files as sc.textFile("file:///sales.csv"). — ben, Mar 17 '17 at 12:30
I am having a similar error.But it is occurring for the main python source file.Any thoughts? — Anandhu Ajayakumar, Jan 09 '18 at 12:36
are you doing it in interactive mode(pyspark shell) or running your job via spark-submit? — Shrinivas Deshmukh, Mar 19 '18 at 06:25
If you are in pyspark shell, it will search for the file in the directory from where you have launched pyspark shell. Please enter full path of your file and try again. For ex: if your file is in root directory, try putting this path: file:///root/sales.csv — Shrinivas Deshmukh, Mar 19 '18 at 06:27

score 12 · Answer 1 · edited Oct 05 '16 at 13:39

12

The file read occurs on the executor node. In order for your code to work, you should distribute your file over all nodes.

In case the Spark driver program is run on the same machine where the file is located, what you could try is read the file (e.g. with f=open("file").read() for python), and then call sc.parallelize to convert the file content to an RDD.

edited Oct 05 '16 at 13:39

Havnar

2,558
7
33
62

answered Feb 01 '16 at 08:33

facha

11,862
14
59
82

facha, thank you for the comment. I see the point why my code failed - the file have to be all slave nodes (not cluster's master node)! – Taka4Sato Feb 01 '16 at 17:17

score 4 · Answer 2 · answered Jan 04 '17 at 10:44

4

If your running in a clustered mode you need to copy the file across all the nodes of same shared file system. Then spark reads that file otherwise you should use HDFS

I copied txt file into HDFS and spark takes file from HDFS.

I copied txt file on the shared filesystem of all nodes then spark read that file.

Both worked for me

answered Jan 04 '17 at 10:44

Raghav

96
1
5

1

Can you post what exactly did you do for reading the file from hdfs? I am trying this rdd = sc.textFile("hdfs://master:54310/cc-news-warc-paths1") which is of no help. – Ravi Ranjan Jul 31 '17 at 13:35

score 3 · Answer 3 · answered Feb 01 '16 at 10:45

I had a similar problem to this, facha is correct that the data you are trying to load must be accessible across your cluster (for both the master and executors).

I believe in your case the file:/ command is still trying to load from your hadoop HDFS which doesnt exist, you can test this by using the following the command

hadoop fs -cat yourfile.csv

I solved this problem by loading the file from hdfs, and reading from hdfs, here is the code:

var conf = new org.apache.hadoop.conf.Configuration();     
var fs = org.apache.hadoop.fs.FileSystem.get(conf); 
var filenamePath = new org.apache.hadoop.fs.Path("myfile.json");  

   if (fs.exists(filenamePath))
   {
       fs.delete(filenamePath, true);
   }

   var fin = fs.create(filenamePath);
   fin.writeBytes(html);
   fin.close();

val metOffice = sql.read.json("myfile.json")

hi, andrew.butkus, thank you for the reference code, quite helped me. — Taka4Sato, Feb 01 '16 at 17:19

Load local file (not HDFS) fails at Spark

Noted1:

Noted2:

3 Answers3