I have these 2 Spark tables:
simx
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and
simy
y0: num 1.00 2.00 3.00 ...
In both tables, each column has the same number of values. Both table x
and y
are saved into handle simX_tbl
and simY_tbl
respectively. The actual data size is quite big and may reach 40GB.
I want to calculate the correlation coefficient of each column in simx
with simy
(let's say like cor(x0, y0, 'pearson')
).
I searched everywhere and I don't think there's any ready-to-use cor
function, so I'm thinking about using the correlation formula itself (just like mentioned in here).
Based on a good explanation in my previous question, I think using mutate_all
or mutate_each
is not very efficient and gives a C stack error
for a bigger data size, so I consider to use invoke
instead to call functions from Spark
directly.
So far I managed to get until here:
exprs <- as.list(paste0("sum(", colnames(simX_tbl),")"))
corr_result <- simX_tbl%>%
spark_dataframe() %>%
invoke("selectExpr", exprs) %>%
invoke("toDF", as.list(colnames(simX_tbl))) %>%
sdf_register("corr_result")
to calculate the sum
of each column in simx
. But then, I realize that I also need to calculate the simy
table and I don't know how to interact the two tables together (like, accessing simy
while manipulating simx
).
Is there any way to calculate the correlation in a better way? Or maybe just how to interact with other Spark table.
My Spark version is 1.6.0
EDIT:
I tried to use combine
function from dplyr
:
xy_df <- simX_tbl %>%
as.data.frame %>%
combine(as.data.frame(simY_tbl)) %>%
# convert both table to dataframe, then combine.
# It will become list, so need to convert to dataframe again
as.data.frame
xydata <- copy_to(sc, xy_df, "xydata") #copy the dataframe into Spark table
But I'm not sure if this is a good solution because:
- Need to load into dataframe inside of R, which I consider non-practical for big size data
When trying to
head
the handlexydata
, the column name becomes a concat of all valuesxydata %>% head Source: query [6 x 790] Database: spark connection master=yarn-client app=sparklyr local=FALSE
c_1_67027262134984_2_44919662134984_1_85728542134984_1_49317262134984_
1 1.670273
2 2.449197
3 1.857285
4 1.493173
5 1.576857
6 -5.672155