1

What is the difference between those two queries:

SELECT my_fun(col_name) FROM my_table;

and

CREATE TABLE new_table AS SELECT my_fun(col_name) FROM my_table;

Where my_fun is a java UDF.

I'm asking, because when I create new table (second query) I receive a java error.

Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
...
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException: Unable to instantiate UDF implementation class com.company_name.examples.ExampleUDF: java.lang.NullPointerException

I found that the source of error is line in my java file:

encoded = Files.readAllBytes(Paths.get(configPath));

But the question is why it works when table is not created and fails if table is created?

mc2
  • 393
  • 6
  • 15
  • 1
    How many rows are in your table? if it's really small, the first query might be executed locally whereas the second one only on Hadoop cluster. Also, does `configPath` indicate the local path or HDFS path? – serge_k Aug 03 '20 at 10:42
  • 1
    the create-table-as-select (2nd example) is CTAS. It has a few restrictions on the tables that it may be used on which could be causing issues. Have you read the section on [CTAS in the Hive DDL manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS))? It lists two restrictions: 1: The target table cannot be an external table. 2: The target table cannot be a list bucketing table. Do either of those apply? – zac Aug 03 '20 at 23:25
  • @serge_k It seems that it is not related to table size, i.e. it fails for small and big tables. I tested configPath on local file system and hdfs and it also fails. But it could be a good point as this path is set as an environment variable. If it is hardcoded it works. I have to check it. – mc2 Aug 04 '20 at 13:10
  • @zac Nope, it is not applicable in this case. – mc2 Aug 04 '20 at 13:11

2 Answers2

1

The problem might be with the way you read the file. Try to pass the file path as the second argument in the UDF, then read as follows

private BufferedReader getReaderFor(String filePath) throws HiveException {
    try {
        Path fullFilePath = FileSystems.getDefault().getPath(filePath);
        Path fileName = fullFilePath.getFileName();
        if (Files.exists(fileName)) {
            return Files.newBufferedReader(fileName, Charset.defaultCharset());
        }
        else
        if (Files.exists(fullFilePath)) {
            return Files.newBufferedReader(fullFilePath, Charset.defaultCharset());
        }
        else {
            throw new HiveException("Could not find \"" + fileName + "\" or \"" + fullFilePath + "\" in inersect_file() UDF.");
        }
    }
    catch(IOException exception) {
        throw new HiveException(exception);
    }
}

private void loadFromFile(String filePath) throws HiveException {
    set = new HashSet<String>();

    try (BufferedReader reader = getReaderFor(filePath)) {
        String line;
        while((line = reader.readLine()) != null) {
            set.add(line);
        }
    } catch (IOException e) {
        throw new HiveException(e);
    }
}

The full code for different generic UDF that utilizes file reader can be found here

serge_k
  • 1,772
  • 2
  • 15
  • 21
1

I think there are several points unclear, so this answer is based on assumptions.

First of all, it is important to understand that hive currently optimize several simple queries and depending on the size of your data, the query that is working for you SELECT my_fun(col_name) FROM my_table; is most likely running locally from the client where you are executing the job, that is why you UDF can access your config file locally available, this "execution mode" is because the size of your data. CTAS trigger a job independent on the input data, this job runs distributed in the cluster where each worker fail accessing your config file.

It looks like you are trying to read your configuration file from the local file system, not from the HDSFS Files.readAllBytes(Paths.get(configPath)), this means that your configuration has to either be replicated in all the worker nodes or be added previously to the distributed cache (you can use add file from this, doc here. You can find another questions here about accessing files from the distributed cache from UDFs.

One additional problem is that you are passing the location of your config file through an environment variable which is not propagated to worker nodes as part of your hive job. You should pass this configuration as a hive config, there is an answer for accessing Hive Config from UDF here assuming that you are extending GenericUDF.

hlagos
  • 7,690
  • 3
  • 23
  • 41