Using monotonically_increasing_id won't give consecutively IDs (pyspark)

Question

I want to create an ID column for my pyspark dataframe, I have a column A that have repeated numbers, I want to take all the different values and assign an ID to each value

I have:

+----+
|   A|
+----+
|1001|
|1002|
|1003|
|1001|
|1003|
|1004|
|1001|
+----+

And I want:

+----+----+
|   A| new|
+----+----+
|1002|   1|
|1001|   2|
|1004|   3|
|1003|   4|
+----+----+

this is my code:

# Libraries
import pyspark 
from pyspark.sql import SQLContext
import pandas as pd
import numpy as np
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)

# Create pyspark dtaframe
df = pd.DataFrame()
df["A"] = [1001,1002,1003,1001,1003,1004,1001]
df = sqlContext.createDataFrame(df)

IDs = df.select("A").distinct()

from pyspark.sql.functions import monotonically_increasing_id 
IDs = IDs.withColumn("new", monotonically_increasing_id() )
IDs.show()

And I get:

+----+-------------+
|   A|          new|
+----+-------------+
|1002| 188978561024|
|1001|1065151889408|
|1004|1511828488192|
|1003|1623497637888|
+----+-------------+

But It should be:

+----+----+
|   A| new|
+----+----+
|1002|   1|
|1001|   2|
|1004|   3|
|1003|   4|
+----+----+

Why I am getting that result?

Possible duplicate of [Using monotonically\_increasing\_id() for assigning row number to pyspark dataframe](https://stackoverflow.com/questions/48209667/using-monotonically-increasing-id-for-assigning-row-number-to-pyspark-datafram) — mkrieger1, Jun 19 '19 at 21:08
Apparently it already works correctly, just not as you expected. — mkrieger1, Jun 19 '19 at 21:09
The [documentation](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id) is very clear on this: *The generated ID is guaranteed to be monotonically increasing and unique, but **not consecutive**.* — pault, Jun 19 '19 at 21:11
Possible duplicate of [Pyspark add sequential and deterministic index to dataframe](https://stackoverflow.com/questions/52318016/pyspark-add-sequential-and-deterministic-index-to-dataframe) — pault, Jun 19 '19 at 21:12
Possible duplicate of [How do I add an persistent column of row ids to Spark DataFrame?](https://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe) — Nikhil Suthar, Jun 20 '19 at 05:33

Nikhil Suthar · Answer 1 · 2019-06-20T06:14:45.750

1

monotonically_increasing_id is guaranteed to be monotonically increasing and unique, but not consecutive. You can go with function row_number() instead of monotonically_increasing_id which will give your desire result more effectively.

>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number, lit

// lit to keep everything in one partition
>>> w = Window.partitionBy(lit(1)).orderBy("A")
>>> df.show()
+----+
|   A|
+----+
|1001|
|1003|
|1001|
|1004|
|1005|
|1003|
|1005|
|1003|
|1006|
|1001|
|1002|
+----+

>>> df1 =  df.select("A").distinct().withColumn("ID", row_number().over(w))
>>> df1.show()
+----+---+
|   A| ID|
+----+---+
|1001|  1|
|1002|  2|
|1003|  3|
|1004|  4|
|1005|  5|
|1006|  6|
+----+---+

edited Jun 20 '19 at 06:14

answered Jun 20 '19 at 06:09

Nikhil Suthar

2,289
1
6
24

1

If you do not want to change order then you can use orderBy(lit(1)) instead of orderBy("A") in window which will give you exact your result. – Nikhil Suthar Jun 20 '19 at 06:30
1

Awesome! You made my day ;) – guiotan Jun 24 '20 at 11:53

Using monotonically_increasing_id won't give consecutively IDs (pyspark)

1 Answers1