I have one question - how to load local file (not on HDFS, not on S3) with sc.textFile at PySpark.
I read this article, then copied sales.csv
to master node's local (not HDFS), finally executed following
sc.textFile("file:///sales.csv").count()
but it returns following error, saying file:/click_data_sample.csv does not exist
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 10, ip-17x-xx-xx-xxx.ap-northeast-1.compute.internal): java.io.FileNotFoundException: File file:/sales.csv does not exist
I tryed file://sales.csv
and file:/sales.csv
but both also failed.
It is very helpful you give me kind advice how to load local file.
Noted1:
- My envrionment is Amazon emr-4.2.0 + Spark 1.5.2.
- All ports are opened
Noted2:
I confirmed load file from HDFS or S3 works.
Here is the code of loading from HDFS - download csv, copy to hdfs in advance then load with sc.textFile("/path/at/hdfs")
commands.getoutput('wget -q https://raw.githubusercontent.com/phatak-dev/blog/master/code/DataSourceExamples/src/main/resources/sales.csv')
commands.getoutput('hadoop fs -copyFromLocal -f ./sales.csv /user/hadoop/')
sc.textFile("/user/hadoop/sales.csv").count() # returns "15" which is number of the line of csv file
Here is the code of loading from S3 - put csv file at S3 in advance then load with sc.textFile("s3n://path/at/hdfs") with "s3n://" flag.
sc.textFile("s3n://my-test-bucket/sales.csv").count() # also returns "15"