121

I'm following the great spark tutorial

so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:

$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)

how can I load that README.md?

suztomo
  • 5,114
  • 2
  • 20
  • 21
Jas
  • 14,493
  • 27
  • 97
  • 148

15 Answers15

199

Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.

SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".

suztomo
  • 5,114
  • 2
  • 20
  • 21
  • Do you happen to know how to do this with Java? I don't see a method. Finding it very frustrating that there's not an easy way to give a path to load a file from a simple file system . – Brad Ellis Dec 01 '18 at 00:25
  • answering myself. There is a --file switch that you pass with the spark-submit. So, the file path can be hard coded or however your config is setup for the app, but you also signal that path . when you submit so that the executors can see the path. – Brad Ellis Dec 01 '18 at 01:20
  • When I specify the path on Windows, why `"file:///C:\\Xiang\\inputfile` and `"file:////C:\\Xiang\\inputfile` both works, while `"file://C:\\Xiang\\inputfile` does not work in the Java code. How about on linux? Should the prefix be `file:///` (three slash), or `file:////` (four slash)? Does `file:////` also work on linux? – XYZ Feb 02 '21 at 09:57
  • I checked in the source code, it is `static final URI NAME = URI.create("file:///");`, so I suppose that it should be hard-coded as `file:///` (three slash) as the prefix. But I still do not understand why `file:////` (four slash) also works. – XYZ Feb 02 '21 at 11:12
  • @YuXiang Do you want to add a link to the line of the source code (in GitHub)? – suztomo Feb 02 '21 at 13:09
  • @suztomo, it is here: https://hadoop.apache.org/docs/r2.7.4/api/src-html/org/apache/hadoop/fs/RawLocalFileSystem.html – XYZ Feb 02 '21 at 13:16
  • @suztomo, my error message: `java.lang.IllegalArgumentException: Wrong FS: file://C:\Xiang\cs_hdfs\csByDate\20190822/C:/Xiang/323Bit/bigfoot, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)` – XYZ Feb 02 '21 at 13:26
  • I’m afraid that I don’t have Windows computer. – suztomo Feb 02 '21 at 22:16
  • How could I check whether a Hadoop environment is set in the Java code? – XYZ Feb 26 '21 at 12:55
26

gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.

zaxliu
  • 2,726
  • 1
  • 22
  • 26
  • 10
    `file:///` is the root folder of the filesystem as seen by the executing JVM, not two levels above the home folder. The URI format as specified in [RFC 8089](https://tools.ietf.org/html/rfc8089) is `file://hostname/absolute/path`. In the local case the `hostname` (authority) component is empty. – Hristo Iliev Jun 16 '18 at 14:31
22

If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.

$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.

$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r--   1 hadoop hadoop         73 2017-05-01 00:49 /hdfs/spark/examples/people.json

$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
Joarder Kamal
  • 1,387
  • 1
  • 20
  • 28
  • The only answer that tells and shows you how to start in local mode. This one needs more upvotes. – Ted Oct 30 '20 at 00:27
22

While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.

Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.

If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path

 rdd = sc.textFile("file:///path/to/file")

If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers

Take care to put file:// in front and the use of "/" or "\" according to OS.

Aklank Jain
  • 1,002
  • 1
  • 13
  • 21
  • 1
    Is there a way that Spark will automatically copy data from its $SPARK_HOME directory to all computing nodes. Or do you need to do that manually? – Matthias Jan 31 '18 at 09:14
  • where is the spark source code handling different file system formats? – Saher Ahwal Mar 07 '18 at 05:01
19

Attention:

Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist. Becasuse executors which run on different workers will not find this file in it's local path.

Matiji66
  • 709
  • 7
  • 14
  • 1
    Can we run spark standalone mode (driver on one node, executors on other nodes) on a local file in the driver? Or should I have the local file present on all the nodes? – sherminator35 Apr 16 '21 at 16:18
13

You need just to specify the path of the file as "file:///directory/file"

example:

val textFile = sc.textFile("file:///usr/local/spark/README.md")
Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
Hamdi Charef
  • 639
  • 2
  • 12
  • 19
10

I have a file called NewsArticle.txt on my Desktop.

In Spark, I typed:

val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)

I needed to change all the \ to / character for the filepath.

To test if it worked, I typed:

textFile.foreach(println)

I'm running Windows 7 and I don't have Hadoop installed.

Gene
  • 10,819
  • 1
  • 66
  • 58
6

This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.

Binita Bharati
  • 5,239
  • 1
  • 43
  • 24
5

This has been discussed into spark mailing list, and please refer this mail.

You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs:

${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md
Nan Xiao
  • 16,671
  • 18
  • 103
  • 164
1

I tried the following and it worked from my local file system.. Basically spark can read from local, HDFS and AWS S3 path

listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")
Pavel Smirnov
  • 4,611
  • 3
  • 18
  • 28
BigData-Guru
  • 1,161
  • 1
  • 15
  • 20
0

This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster:

Load the raw HVAC.csv file, parse it using the function

data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")

We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder.

For example: If the path for your file in File Explorer in Spark cluster dashboard is:

sflcc1\sflccspark1\HdiSamples\SensorSampleData\hvac

So to describe the path is as follows: sflcc1: is the name of the storage account. sflccspark: is the cluster node name.

So we refer to the current cluster node name with the relative three slashes.

Hope this helps.

Mostafa
  • 3,296
  • 2
  • 26
  • 43
0

If your trying to read file form HDFS. trying setting path in SparkConf

 val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
 conf.set("fs.defaultFS", "hdfs://hostname:9000")
VIJ
  • 1,516
  • 1
  • 18
  • 34
  • Please add 4-space/tab indentation to your code so that it gets formatted as code. Best regards – YakovL Sep 19 '17 at 08:34
0

You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:

StructType schemata = DataTypes.createStructType(
            new StructField[]{
                    createStructField("COL1", StringType, false),
                    createStructField("COL2", StringType, false),
                    ...
            }
    );

String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );

List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
    String line;
    while ((line = br.readLine()) != null) {
      String[] vals = line.split(separator);
      result.add(vals);
    }
 } catch (Exception ex) {
       System.out.println(ex.getMessage());
       throw new RuntimeException(ex);
  }
  JavaRDD<String[]> jRdd = jsc.parallelize(result);
  JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
  Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);

Now you can use dataframe data in your code.

Andrushenko Alexander
  • 1,839
  • 19
  • 14
0

Reading local file in Apache-Spark. This worked for me:

var a = sc.textFile("/home/omkar/Documents/text_input").flatMap(line => line.split(" ")).map(word => (word, 1));
Omkar Gaikwad
  • 61
  • 1
  • 4
-7

try

val f = sc.textFile("./README.md")
Soumya Simanta
  • 11,523
  • 24
  • 106
  • 161
  • `scala> val f = sc.textFile("./README.md") 14/12/04 12:54:33 INFO storage.MemoryStore: ensureFreeSpace(81443) called with curMem=164073, maxMem=278302556 14/12/04 12:54:33 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 79.5 KB, free 265.2 MB) f: org.apache.spark.rdd.RDD[String] = ./README.md MappedRDD[5] at textFile at :12 scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md at` – Jas Dec 04 '14 at 17:55
  • Can you do a `pwd` on the bash shell `bash-4.1#` – Soumya Simanta Dec 04 '14 at 17:58
  • bash-4.1# pwd /usr/local/spark-1.1.0-bin-hadoop2.4 – Jas Dec 04 '14 at 18:05
  • This works for me on spark without hadoop/hdfs. However, it doesn't seem to be working for the OP, as it gave them an error dump. – Paul Jul 09 '15 at 02:12