I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
And this dataframe in R environment:
penalty
p: num 1.23 2.34 3.45 ...
with the number of rows in both table and dataframe are the same.
I want to subtract y
values in xydata
with p
in penalty
, something that is like y = y - p
.
Is there any way to do this? I know I can use mutate
to update y
, that can only be used in the same table.
I'm thinking about combining both table into a new Spark table:
xydata_new
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
p: num 1.23 2.34 3.45 ...
so that I can use mutate(y = y - p)
, but again I cannot find a good way to combine both tables. I tried to use dplyr::combine
in my other question, but the result is not satisfying.
Data size is big, it can reach 40GB and maybe even more in the future, so collect
-ing all tables into R environment to then be manipulated within R (cbind
then export as Spark table with tbl
) is not an option.