0

For example, if I have a dataframe with a name column, where each name can occur multiple times:

+------+
| name |
+------+
|Alice |
|Bob   |
|Alice |
|Chloe |
|Chloe |
+------+

I want to have a column where each name gets a unique id starting from 0:

+------+----+
| name | id |
+------+----+
|Alice | 0  |
|Bob   | 1  |
|Alice | 0  |
|Chloe | 2  |
|Chloe | 2  |
+------+----+

How do I achieve this using PySpark? One possible way is to create a data frame with a column of distinct names and assign index such as using row_number. But that involves joining back to the original table. So I wonder if there's a direct way to achieve this.

Algorithman
  • 1,309
  • 1
  • 16
  • 39

2 Answers2

1

One easy method is to use StringIndexer - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html

from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import col
df = spark.createDataFrame([("Alice", 1), ("Bravo", 2), ("Charlie", 3), ("Alice", 4), ("Bravo", 5)], ["name", "id"])

string_indexer = StringIndexer(inputCol='name', outputCol='index').setHandleInvalid("keep")
model = string_indexer.fit(df)
df_index = model.transform(df).select('id', 'name', col('index').cast('int'))
df_index.show()   
+---+-------+-----+
| id|   name|index|
+---+-------+-----+
|  1|  Alice|    0|
|  2|  Bravo|    1|
|  3|Charlie|    2|
|  4|  Alice|    0|
|  5|  Bravo|    1|

+---+-------+-----+
greenie
  • 409
  • 3
  • 6
1

Use window functions. Logic and code below

   new =(df.withColumn('order', row_number().over(Window.partitionBy(lit('1')).orderBy(lit('1'))))#Create increasing id
          .withColumn('id', dense_rank().over(Window.partitionBy().orderBy('name'))-1).orderBy('order')#Use window function; dense_ranl to generate new id
          .drop('order')
         ).show()

+-------+---+
|   name| id|
+-------+---+
|  Alice|  0|
|  Bravo|  1|
|Charlie|  2|
|  Alice|  0|
|  Bravo|  1|
+-------+---+
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • thanks this works. Just curious, will it be possible to assign id without ordering? based on the occurrence of name, e.g. if Bravo shows before Alice then bravo will get id 0 and Alice 1 – Algorithman Sep 14 '22 at 01:56