How to generate id and put in pyspark df?

Asked Aug 07 '18 at 14:21

Active Aug 07 '18 at 14:36

Viewed 93 times

I am reading data from a hive table and creating DataFrame in pyspark using:

hive_df = sqlContext.sql("select * from table")

The DataFrame hive_df has three columns: (cust_id, name, l_name)

In hive table the cust_id field is null for all the records, so I want to put some value, in incremental manner.

Data in hive table

cust_id,name,l_name
       , abc,   def
       , ghi,   jkl
       , mno,   pqr

Desired Output

cust_id,name,l_name
   1000, abc,   def
   1001, ghi,   jkl
   1002, mno,   pqr

edited Aug 07 '18 at 14:36

pault

asked Aug 07 '18 at 14:21

sandeep rathore

1

use [`pyspark.sql.functions.monotonically_increasing_id`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id) like this: `from pyspark.sql.functions import monotonically_increasing_id; hive_df = hive_df.withColumn("cust_id", monotonically_increasing_id())` – pault Aug 07 '18 at 14:39
1

If the starting value does not matter see this question, ie. the monotonically increasing id: https://stackoverflow.com/questions/46213986/how-could-i-add-a-column-to-a-dataframe-in-pyspark-with-incremental-values – dorvak Aug 07 '18 at 14:39
its working for me thank you pault and dorvak, its possible that counter will start from some initial value, I want to give 1000. – sandeep rathore Aug 07 '18 at 18:41

0 Answers0