Hive auto increment UDF doesn't give desired results

Question

I am trying to create a UDF in Hive. This UDF has to auto increment a hive table column called id.

Now the following is the Java code to create the UDF.

package myudf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;


@UDFType(deterministic = false, stateful = true)
public class autoincrement extends UDF{

      int lastValue;

    public int evaluate() {

     lastValue++;

        return lastValue;

   }

}

Now I am able create a jar file and add the jar file to hive like below:

add jar /home/cloudera/Desktop/increment.jar;

Then create a temporary function

create temporary function inc as 'myudf.autoincrement';

Create table like below.

Create table abc(id int, name string)

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'Tim';

Do select statement:

select * from abc;

Output:

1  Tim

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'John';

Do select statement:

select * from abc

Output:

1  Tim
1  John

But what I was expecting was when I insert values for the 2nd time.

My expected output was :

1  Tim
2  John

How to get the expected output. What should I change in the Java code to get the desired result?

And Can I use the same function in Spark as well

In spark when I do

sqlContext.sql("show functions")

It shows the list of all functions available in Hive

But when I do

sqlContext.sql("INSERT into TABLE abc SELECT inc() as id, 'Jim'")

I got the below error

pyspark.sql.utils.AnalysisException: u'undefined function inc; line 1 pos 29'

How to create the same UDF in pyspark and get the desired output

What happens when the insert statements are executed at the same time?

This isn't how UDF's work, a UDF can only know about the values its passed, you can't have a global state. — maxymoo, May 25 '17 at 03:43
if you just need unique id you could use the solution in https://stackoverflow.com/questions/33102727/primary-keys-with-apache-spark, otherwise you're going to have to do a sort — maxymoo, May 25 '17 at 03:44

score 1 · Accepted Answer · answered May 25 '17 at 13:14

Follow the below steps

change your insert to INSERT into TABLE abc SELECT max(id)+1 as id, 'Tim' from abc;
or
Modify the UDF to take int column as input and return input+1
modify your insert to INSERT into TABLE abc SELECT inc(max(id)) as id, 'Tim' from abc;

You have to try the correctness of the SQL in hive as I have checked and it works in MYSQL.

Hive auto increment UDF doesn't give desired results

1 Answers1