Spark assign a number for each word in collect

Question

I have an collect data of dataFrame column in spark

temp = df.select('item_code').collect()

Result: 

[Row(item_code=u'I0938'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0010'),
 Row(item_code=u'I0010'),
 Row(item_code=u'C0723'),
 Row(item_code=u'I1097'),
 Row(item_code=u'C0117'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0010'),
 Row(item_code=u'I0009'),
 Row(item_code=u'C0117'),
 Row(item_code=u'I0009'),
 Row(item_code=u'I0596')]

And now i would like assign a number for each word, if words is duplicate, it have the same number. I using Spark, RDD , not Pandas

Please help me resolve this problem!

score 1 · Answer 1 · answered Oct 02 '17 at 03:47

1

You could create a new dataframe which has distinct values.

val data = temp.distinct()

Now you can assigne a unique id using

import org.apache.spark.sql.functions._ 

val dataWithId = data.withColumn("uniqueID",monotonicallyIncreasingId)

Now you can join this new dataframe with the original dataframe and select the unique id.

val tempWithId = temp.join(dataWithId, "item_code").select("item_code", "uniqueID")

The code is assuming scala. But something similar should exist for pyspark as well. Just consider this as a pointer.

answered Oct 02 '17 at 03:47

user238607

1,580
3
13
18

Now, can you help me assign a number have type is int (not bigInt)? – Phong Nguyen Oct 02 '17 at 07:21
@PhongNguyen : Look at the answer here : https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark . You will have to cast the column to IntegerType – user238607 Oct 02 '17 at 12:20

Spark assign a number for each word in collect

1 Answers1