2

I'm new to the Spark environment. I use Spark SQL in my project. I want to create auto increment field in a Spark SQL temporary table. I created UDF, but it didn't work properly. I tried various examples on the internet. This is my Java POJO class:

public class AutoIcrementId  {
    int lastValue;
    public int evaluate() {
        lastValue++;
        return lastValue;
    }
}
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
Dil
  • 307
  • 1
  • 3
  • 15

1 Answers1

0

We can use Hive stateful UDF for autoincrement values. Code will goes like this.

package org.apache.hadoop.hive.contrib.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;

/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
  private LongWritable result = new LongWritable();

  public UDFRowSequence() {
    result.set(0);
  }

  public LongWritable evaluate() {
    result.set(result.get() + 1);
    return result;
  }
}

// End UDFRowSequence.java

Register UDF:

CREATE TEMPORARY FUNCTION auto_increment_id AS 
   'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'

Usage:

SELECT auto_increment_id() as id, col1, col2 FROM table_name

Similar question was answered here (How to implement auto increment in spark SQL)

Community
  • 1
  • 1
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • I need something like this but the question is, Will it scale with data of 200 millions. Actually I want to break a big file containing 200 millions rows in smaller files of exact 10K rows containing file. I thought to add auto-increment number for each row and read in batch with the help of like this (id >10,001 and id < 20,000). Will this work at that scale, please suggest. – Hrishikesh Mishra Mar 03 '18 at 20:06