1

I'm trying to pass in an array to a Hive UDF via collect_set:

SELECT ..., collect_set(...) FROM ...;

And my Hive UDF wants to take in this array and append the first letter of each array element to an output string:

public class MyUDF extends UDF {

public String evaluate(String[] array) {    
    String output = "";

    // Check for valid argument
    if (array == null) return output;

    try {
        // Add first character of every array element to output string
        for (int i = 0; i < array.length; i++) {
            output += array[i].charAt(0);

            // If there is another array element after this one, append DELIMITER
            if (i + 1 < array.length) output += ",";
        }
    } catch (Exception e) { 
        System.out.println(e.getMessage());
        System.exit(1);
    }
    return output;
}

But the issue I get when I try to run:

ADD JAR ./list_builder.jar;
CREATE TEMPORARY FUNCTION build_list as 'MyCustomUDF.MyUDF';

SELECT ..., build_list(collect_set(description)) FROM ...;

...
FAILED: SemanticException [Error 10014]: Line 142:21 Wrong arguments 'description': No matching method for class MyCustomUDF.MyUDF with (array<string>). Possible choices: _FUNC_(struct<>)

I've tried changing String[] to ArrayList and List but I'm still hitting the same error.

Note: The output of collect_set is something like: [L-ADD", "P-OAN", "P-OAH"], so I'm expecting an output from my UDF like: L,P,P.

Any ideas?

Thanks.

Travis Liew
  • 787
  • 1
  • 11
  • 34
  • I see no problem with your code. What is the exact input you are giving? I see something like "description" data structure your are trying to pass but that is not acceptable. You might have to do the conversion as necessary. – Raghuveer May 25 '15 at 07:29
  • 2
    You might not need custom UDF for your usecase. Have you tried using substr function? – kostya May 25 '15 at 10:45
  • http://stackoverflow.com/questions/6445339/collect-set-in-hive-keep-duplicates – Kishore May 26 '15 at 05:25

2 Answers2

0

Following @kostya's answer, I used substr:

SELECT ..., collect_set(substr(description,0,1)) FROM ...;

Which meant I didn't need a UDF.

Thanks.

Travis Liew
  • 787
  • 1
  • 11
  • 34
0

Try ArrayList<String> instead of String[] because hive sends array as array<String> not String[]

public class MyUDF extends UDF {

public String evaluate(ArrayList<String> array) {    

}