0

I have this Spark table:

xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...

and a handle named xy_df that is connected to this table.

I want to invoke the selectExpr function to calculate the mean, something like:

xy_centered <- xy_df %>%  
    spark_dataframe() %>% 
    invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))

which is also applicable to all other columns.

But when I run it, it gives this error:

Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

I know this happens because, in common SQL rules, I didn't put a GROUP BY clause for columns contained in the aggregate function (mean). How do I put the GROUP BY to the invoke method?

Previously, I manage to do complete the purpose using another way, which is by:

  1. Calculate the mean of each column by summarize_all
  2. Collect the mean inside R
  3. Apply this mean using invoke and selectExpr

as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.

My Spark version is 1.6.0

Community
  • 1
  • 1
Benny Suryajaya
  • 63
  • 1
  • 12
  • Could you try this? `xy_centered <- xy_df %>% spark_dataframe() %>% invoke("group_by", list("y0")) %>% invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))` – Jaime Caffarel Apr 28 '17 at 07:02
  • @JaimeCaffarel it gave this error `Error: java.lang.IllegalArgumentException: invalid method group_by for object 157`. Does `group_by` also covered by `invoke`? I don't really understand what are the methods that can be used with `invoke`, are they only `Scala` methods? – Benny Suryajaya Apr 28 '17 at 12:02

0 Answers0