1

In Scala Spark

val df = sc.parallelize(0 to 3).toDF("x")
df.registerTempTable("df")

sqlContext.sql("select * from df").show
+---+
|  x|
+---+
|  0|
|  1|
|  2|
|  3|
+---+

and would like to average non zero values only. Tried this (does not work),

sqlContext.sql("select avg(nullif(x,0)) from df").show

What is a simple and efficient way to average non zero values ?

echo
  • 1,241
  • 1
  • 13
  • 16
  • I guess NULLIF is not supported in Spark SQL "org.apache.spark.sql.AnalysisException: undefined function NULLIF;" – RoyaumeIX May 31 '16 at 05:27
  • Try to check this thread : http://stackoverflow.com/questions/32357164/sparksql-how-to-deal-with-null-values-in-user-defined-function – RoyaumeIX May 31 '16 at 05:31

3 Answers3

4

Try :

sqlContext.sql(
  "select avg(case when id=0 then null else id end), avg(id) from df"
).show
zero323
  • 322,348
  • 103
  • 959
  • 935
Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30
1

To select non-zero values, have a where clause like

 sqlContext.sql("select avg(x) from df where x >0").show

The response that I get is

+---+
|_c0|
+---+
|2.0|
+---+
mohit
  • 4,968
  • 1
  • 22
  • 39
1

You could also try this without sql statement:

Java:

df.filter(df.col("x").gt(0).or(df.col("x").lt(0))) // x > 0 or x < 0
  .select(org.apache.spark.sql.functions.avg("x")) // avg(x)
  .show();

Scala:

df.filter(df("x")>0 || df("x")<0)
  .select(avg("x"))
  .show
Yuan JI
  • 2,927
  • 2
  • 20
  • 29