2

My Spark Master needs to read a file in order. Here is what I am trying to avoid (in pseudocode):

if file-path starts with "hdfs://"
    Read via HDFS API
else
    Read via native FS API

I think the following would do the trick, letting Spark deal with distinguishing between local/HDFS:

JavaSparkContext sc = new JavaSparkContext(new SparkConf());
List<String> lines = sc.textFile(path).collect();

Is it safe to assume that lines will be in order; i.e. that lines.get(0) is the first line of the file, lines.get(1) is the second line; etc?

If not, any suggestions on how to avoid explicitly checking FS type?

AntsySysHack
  • 129
  • 1
  • 8
  • Not sure what would be the best way to do this. The `textFile` method does not guarantee the order, however, since you collect the data directly it could be that it is always preserved. A more safe way would be `wholeTextFiles` see here: https://stackoverflow.com/questions/47129950/spark-textfile-vs-wholetextfiles – Shaido Jan 04 '18 at 03:09
  • 1
    `lines.get(0)` will always be the first line of the file. `textFile` internally implements Hadoop's `TextInputFormat` which reads a file line by line. Line endings are denoted by `\n` or `\r\n`. The order of lines in the RDD will be same as the order of lines in your file. – philantrovert Jan 04 '18 at 06:46
  • I see two contradicting replies here, and couldnt understand much about the order guarantee in textfile (ignoring wholetextfile for now). Any hints/ info about order guarantee of textfile if the file is partitioned. – drk Jan 24 '20 at 13:59

0 Answers0