2
x = df.withColumn("id_col", F.monotonically_increasing_id())

returns random long integers instead of sorted int numbersenter image description here

Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • 1
    Possible duplicate of [Using monotonically\_increasing\_id() for assigning row number to pyspark dataframe](https://stackoverflow.com/questions/48209667/using-monotonically-increasing-id-for-assigning-row-number-to-pyspark-datafram) – pault Oct 30 '19 at 14:52

1 Answers1

3

What you are seeing is the expected behavior of the function. From the documentation

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records

This is why you see long random integers. They may not be sequential but they are in increasing order and for all practical purposes, unique.

Clock Slave
  • 7,627
  • 15
  • 68
  • 109