Sparklyr: how to apply an operation between a column in Spark table and an R dataframe?

Question

I have this Spark table:

xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...

And this dataframe in R environment:

penalty
p: num 1.23 2.34 3.45 ...

with the number of rows in both table and dataframe are the same.

I want to subtract y values in xydata with p in penalty, something that is like y = y - p.

Is there any way to do this? I know I can use mutate to update y, that can only be used in the same table.

I'm thinking about combining both table into a new Spark table:

xydata_new
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
p: num 1.23 2.34 3.45 ...

so that I can use mutate(y = y - p), but again I cannot find a good way to combine both tables. I tried to use dplyr::combine in my other question, but the result is not satisfying.

Data size is big, it can reach 40GB and maybe even more in the future, so collect-ing all tables into R environment to then be manipulated within R (cbind then export as Spark table with tbl) is not an option.

Spark does not guarantuee any row order in its DataFrames at all, so i am not sure that some kind of cbind between a SparkDataFrame and a local R dataframe would yield the results you expect. Is row order important for you? — Janna Maas, Apr 27 '17 at 10:31
Unfortunately row order is important because each row represents one sample, so they have to be paired. Can you please explain why the row order is not guaranteed? — Benny Suryajaya, Apr 27 '17 at 12:27
it's because Spark Dataframes are partitioned. It depends on how you've created both your DataFrame and your penalty whether you can assume the row order is preserved (e.g., if you've done a `sort` then your data will remain in that order). [here](http://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method/29281548#29281548) is a hopefully helpful discussion on that topic. — Janna Maas, Apr 27 '17 at 13:07

Sparklyr: how to apply an operation between a column in Spark table and an R dataframe?

0 Answers0