I want to create an ID column for my pyspark dataframe, I have a column A that have repeated numbers, I want to take all the different values and assign an ID to each value
I have:
+----+
| A|
+----+
|1001|
|1002|
|1003|
|1001|
|1003|
|1004|
|1001|
+----+
And I want:
+----+----+
| A| new|
+----+----+
|1002| 1|
|1001| 2|
|1004| 3|
|1003| 4|
+----+----+
this is my code:
# Libraries
import pyspark
from pyspark.sql import SQLContext
import pandas as pd
import numpy as np
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
# Create pyspark dtaframe
df = pd.DataFrame()
df["A"] = [1001,1002,1003,1001,1003,1004,1001]
df = sqlContext.createDataFrame(df)
IDs = df.select("A").distinct()
from pyspark.sql.functions import monotonically_increasing_id
IDs = IDs.withColumn("new", monotonically_increasing_id() )
IDs.show()
And I get:
+----+-------------+
| A| new|
+----+-------------+
|1002| 188978561024|
|1001|1065151889408|
|1004|1511828488192|
|1003|1623497637888|
+----+-------------+
But It should be:
+----+----+
| A| new|
+----+----+
|1002| 1|
|1001| 2|
|1004| 3|
|1003| 4|
+----+----+
Why I am getting that result?