I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df
that is connected to this table.
I want to invoke
the selectExpr
function to calculate the mean
, something like:
xy_centered <- xy_df %>%
spark_dataframe() %>%
invoke("selectExpr", list("( y0-mean(y0) ) AS y0mean"))
which is also applicable to all other columns.
But when I run it, it gives this error:
Error: org.apache.spark.sql.AnalysisException: expression 'y0' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I know this happens because, in common SQL rules, I didn't put a GROUP BY
clause for columns contained in the aggregate function (mean
). How do I put the GROUP BY
to the invoke
method?
Previously, I manage to do complete the purpose using another way, which is by:
- Calculate the
mean
of each column bysummarize_all
- Collect the
mean
inside R - Apply this mean using
invoke
andselectExpr
as explained in this answer, but now I'm trying to speed up the execution time a bit by putting all operation inside the Spark itself, without retrieving anything to R.
My Spark version is 1.6.0