Is there a way to generate rownumber without converting the dataframe into rdd in pyspark 1.3.1?

Asked Apr 11 '16 at 07:44

Active Apr 11 '16 at 12:19

Viewed 53 times

I am using pyspark 1.3.1, I need to generate unique id/number for each row in a dataframe.

Since window functions are not available with Pyspark Version:1.3.1, I am not able to make use of rownumber function.

How I can bring in rownumber without rownumber function and without converting the dataframe into RDD?

edited Apr 11 '16 at 12:19

asked Apr 11 '16 at 07:44

Mohan

maybe use the underlying RDD and use `zipWithIndex()` ? – Zahiro Mor Apr 11 '16 at 08:11
thank u. Is there any approach to generate rownumber without converting the dataframe into rdd? I am processing a very large file and trying to reduce unnecessary steps – Mohan Apr 11 '16 at 09:27
Do you need them absolutely sequential (no gaps), or can their be gaps as long as the ordering is maintained? (e.g. [1,2,3,4,5,6,7] vs. [1,2,3,1001,1002,1003,1004]) ? – David Griffin Apr 11 '16 at 11:16
Gaps are ok but the each number in the sequence has to be unique. In your lists, second one also fine – Mohan Apr 11 '16 at 12:20

0 Answers0