0

I am trying to create a UDF in Hive. This UDF has to auto increment a hive table column called id.

Now the following is the Java code to create the UDF.

package myudf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;


@UDFType(deterministic = false, stateful = true)
public class autoincrement extends UDF{

      int lastValue;

    public int evaluate() {

     lastValue++;

        return lastValue;

   }

} 

Now I am able create a jar file and add the jar file to hive like below:

add jar /home/cloudera/Desktop/increment.jar;

Then create a temporary function

create temporary function inc as 'myudf.autoincrement';

Create table like below.

Create table abc(id int, name string)

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'Tim';

Do select statement:

select * from abc;

Output:

1  Tim

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'John';

Do select statement:

select * from abc

Output:

1  Tim
1  John

But what I was expecting was when I insert values for the 2nd time.

My expected output was :

1  Tim
2  John

How to get the expected output. What should I change in the Java code to get the desired result?

And Can I use the same function in Spark as well

In spark when I do

sqlContext.sql("show functions") 

It shows the list of all functions available in Hive

But when I do

sqlContext.sql("INSERT into TABLE abc SELECT inc() as id, 'Jim'")

I got the below error

pyspark.sql.utils.AnalysisException: u'undefined function inc; line 1 pos 29'

How to create the same UDF in pyspark and get the desired output

What happens when the insert statements are executed at the same time?

User12345
  • 5,180
  • 14
  • 58
  • 105
  • This isn't how UDF's work, a UDF can only know about the values its passed, you can't have a global state. – maxymoo May 25 '17 at 03:43
  • if you just need unique id you could use the solution in https://stackoverflow.com/questions/33102727/primary-keys-with-apache-spark, otherwise you're going to have to do a sort – maxymoo May 25 '17 at 03:44

1 Answers1

1

Follow the below steps

  1. change your insert to INSERT into TABLE abc SELECT max(id)+1 as id, 'Tim' from abc;
    or
  2. Modify the UDF to take int column as input and return input+1
  3. modify your insert to INSERT into TABLE abc SELECT inc(max(id)) as id, 'Tim' from abc;

You have to try the correctness of the SQL in hive as I have checked and it works in MYSQL.

Abraham
  • 423
  • 3
  • 9