4

Say If I do, something as shown below.

val rdd = sc.textFile("someFile.txt")
val rddWithLines = rdd.zipWithIndex

Would the indices added by zipWithIndex correspond to the line numbers (first line being 0 of course) in the input file? Or is it that the order gets broken in this case?

pythonic
  • 20,589
  • 43
  • 136
  • 219
  • That should work: "For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file." from [this answer](http://stackoverflow.com/a/29301258/2661491) – evan.oman Nov 30 '16 at 21:20

2 Answers2

8

zipWithIndex is a map-only transformation (it doesn't shuffle) so order will be correct. You can safely use it here.

-2

SparkContext.textFile can create multiple partitions for each file. If these partitions are in the correct order you should get the correct result. See this answer for more information.

Community
  • 1
  • 1
Daniel Shields
  • 1,452
  • 1
  • 12
  • 7
  • 2
    This answer is not correct. zipWithIndex does exactly what the OP wants, regardless of partitioning. – Tim Dec 01 '16 at 01:04
  • I agree as long as the partitions are in the correct order. – Daniel Shields Dec 01 '16 at 17:13
  • 2
    How would they be out of order? Ordered partitions is a guarantee of the HadoopRDD class which `sc.textFile` uses. – Tim Dec 01 '16 at 17:51
  • @TimP Would the ordering be preserved even in case of reading multiple files through sc.textFile? – girip11 Jun 21 '17 at 17:40
  • 1
    Loading multiple text files with `sc.textFile` is just like loading each one separately with `sc.textFile` and unioning the result. The partitions from each file will be ordered. – Tim Jun 21 '17 at 20:14