Spark SQL sql("").first().getDouble(0) give me inconsistent results

Question

I have the below query which is supposed to find an average of the column values and return me the result which is a single number.

val avgVal = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first().getDouble(0)

I'm facing inconsistent behavior at this statement. This often fails with below error however it gives non-NULL results when executed through Hive."

18/05/10 11:01:12 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException: Value at index 0 in null
java.lang.NullPointerException: Value at index 0 in null
    at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
    at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)

The reason why I use HiveContext instead of SQLContext is that the later doesn't support some of the aggregation functions which I use extensively in my code.

Could you please help me understand why this problem occurs and how to solve?

What is the result of `hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first()`? — J-Alex, May 10 '18 at 09:51
I get non-Null result when debugged. Infact my job completes successfully when I rerun for the same set data. — pushpavanthar, May 10 '18 at 10:01
And I verified that I've all non-Null numbers in amount column. — pushpavanthar, May 10 '18 at 10:05

Bejond · Answer 1 · 2018-05-10T10:22:04.320

1

You need to divide query and get into two parts:

var result = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'");
var first = result.first();
if (first != null && !first.isNullAt(0)) {
var avgVal = first.getDouble(0);
}

This would avoid NPE. This would also be needed in List and array.

For insert or update query, you even need to surround with try...catch block to catch runtime exception.

edited May 10 '18 at 10:22

answered May 10 '18 at 09:40

Bejond

1,188
1
11
18

my point is why am I getting Null value when there is a proper genuine result for the query. – pushpavanthar May 10 '18 at 09:46
@puru did you try to debug the value of `result` in my code? Since HiveContext is subclass of SQLContext I don't see any difference between `sql(str)`. Actually, they are the same. – Bejond May 10 '18 at 09:53
This is not correct. The `first ` itself may not be `null`, but `null` here is value at the index position. – J-Alex May 10 '18 at 10:04

J-Alex · Answer 2 · 2018-05-10T10:24:08.980

Let's analyze the case and possible reasons when this exception can be thrown.

Row row = hiveContext.sql("select info, name, desc, id from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

If the value of the row above returned something like:

[null, Kevin, cash, 300]

Trying to get getDouble(0) will lead to java.lang.NullPointerException: Value at index 0 in null

You can try the following approach:

Row row = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

if (!row.isNullAt(0))
   double d = row.getDouble(0);
else
   logger.error("Value at index zero is null");

If you'll check the sources, the library class doing opposite:

private static Object getAnyValAs(Row $this, int i) {
    if($this.isNullAt(i)) {
        throw new NullPointerException((new StringContext(scala.Predef..MODULE$.wrapRefArray((Object[])(new String[]{"Value at index ", " is null"})))).s(scala.Predef..MODULE$.genericWrapArray(new Object[]{BoxesRunTime.boxToInteger(i)})));
    } else {
        return $this.getAs(i);
    }
}

The Row of Dataset should be single double value. See `select rount(avg(amount), 4)` — Bejond, May 10 '18 at 10:06
I understand your concern about handling NullPointerException. I wanted to understand why the above code gives inconsistent results. Sometimes NULL and sometimes valid data when operated on the same set of an unchanged dataset. — pushpavanthar, May 11 '18 at 11:21

Spark SQL sql("").first().getDouble(0) give me inconsistent results

2 Answers2

Linked