Would zipWithIndex after sc.textFile give the correct line numbers?

Question

Say If I do, something as shown below.

val rdd = sc.textFile("someFile.txt")
val rddWithLines = rdd.zipWithIndex

Would the indices added by zipWithIndex correspond to the line numbers (first line being 0 of course) in the input file? Or is it that the order gets broken in this case?

That should work: "For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file." from [this answer](http://stackoverflow.com/a/29301258/2661491) — evan.oman, Nov 30 '16 at 21:20

score 8 · Accepted Answer · answered Nov 30 '16 at 21:21

8

zipWithIndex is a map-only transformation (it doesn't shuffle) so order will be correct. You can safely use it here.

answered Nov 30 '16 at 21:21

score -2 · Answer 2 · edited May 23 '17 at 12:01

-2

SparkContext.textFile can create multiple partitions for each file. If these partitions are in the correct order you should get the correct result. See this answer for more information.

edited May 23 '17 at 12:01

Community

1
1

answered Nov 30 '16 at 21:52

Daniel Shields

1,452
1
12
7

2

This answer is not correct. zipWithIndex does exactly what the OP wants, regardless of partitioning. – Tim Dec 01 '16 at 01:04
I agree as long as the partitions are in the correct order. – Daniel Shields Dec 01 '16 at 17:13
2

How would they be out of order? Ordered partitions is a guarantee of the HadoopRDD class which `sc.textFile` uses. – Tim Dec 01 '16 at 17:51
@TimP Would the ordering be preserved even in case of reading multiple files through sc.textFile? – girip11 Jun 21 '17 at 17:40
1

Loading multiple text files with `sc.textFile` is just like loading each one separately with `sc.textFile` and unioning the result. The partitions from each file will be ordered. – Tim Jun 21 '17 at 20:14

Would zipWithIndex after sc.textFile give the correct line numbers?

2 Answers2

Linked