0

I am using pyspark 1.3.1, I need to generate unique id/number for each row in a dataframe.

Since window functions are not available with Pyspark Version:1.3.1, I am not able to make use of rownumber function.

How I can bring in rownumber without rownumber function and without converting the dataframe into RDD?

Mohan
  • 867
  • 2
  • 7
  • 25
  • maybe use the underlying RDD and use `zipWithIndex()` ? – Zahiro Mor Apr 11 '16 at 08:11
  • thank u. Is there any approach to generate rownumber without converting the dataframe into rdd? I am processing a very large file and trying to reduce unnecessary steps – Mohan Apr 11 '16 at 09:27
  • Do you need them absolutely sequential (no gaps), or can their be gaps as long as the ordering is maintained? (e.g. [1,2,3,4,5,6,7] vs. [1,2,3,1001,1002,1003,1004]) ? – David Griffin Apr 11 '16 at 11:16
  • Gaps are ok but the each number in the sequence has to be unique. In your lists, second one also fine – Mohan Apr 11 '16 at 12:20

0 Answers0