Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

Question

I have these 2 Spark tables:

simx
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...

and

simy
y0: num 1.00 2.00 3.00 ...

In both tables, each column has the same number of values. Both table x and y are saved into handle simX_tbl and simY_tbl respectively. The actual data size is quite big and may reach 40GB.

I want to calculate the correlation coefficient of each column in simx with simy (let's say like cor(x0, y0, 'pearson') ).

I searched everywhere and I don't think there's any ready-to-use cor function, so I'm thinking about using the correlation formula itself (just like mentioned in here).

Based on a good explanation in my previous question, I think using mutate_all or mutate_each is not very efficient and gives a C stack error for a bigger data size, so I consider to use invoke instead to call functions from Spark directly.

So far I managed to get until here:

exprs <- as.list(paste0("sum(", colnames(simX_tbl),")"))

corr_result <- simX_tbl%>%  
  spark_dataframe() %>% 
  invoke("selectExpr", exprs) %>% 
  invoke("toDF", as.list(colnames(simX_tbl))) %>% 
  sdf_register("corr_result")

to calculate the sum of each column in simx. But then, I realize that I also need to calculate the simy table and I don't know how to interact the two tables together (like, accessing simy while manipulating simx).

Is there any way to calculate the correlation in a better way? Or maybe just how to interact with other Spark table.

My Spark version is 1.6.0

EDIT: I tried to use combine function from dplyr:

xy_df <- simX_tbl %>% 
  as.data.frame %>%
  combine(as.data.frame(simY_tbl)) %>%
  # convert both table to dataframe, then combine. 
  # It will become list, so need to convert to dataframe again
  as.data.frame 

xydata <- copy_to(sc, xy_df, "xydata") #copy the dataframe into Spark table

But I'm not sure if this is a good solution because:

Need to load into dataframe inside of R, which I consider non-practical for big size data
When trying to head the handle xydata, the column name becomes a concat of all values
```
xydata %>% head
Source:   query [6 x 790]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
```
c_1_67027262134984_2_44919662134984_1_85728542134984_1_49317262134984_
1 1.670273
2 2.449197
3 1.857285
4 1.493173
5 1.576857
6 -5.672155

score 3 · Accepted Answer · edited May 23 '17 at 11:46

Personally I would solve it by going back to the input dataset. Just for the record the input data has been loaded using CSV reader:

df <- spark_read_csv(
  sc, path = path, name = "simData", delimiter = " ", 
  header = "false", infer_schema = "false"
) %>% rename(y = `_c0`, xs = `_c1`)

and looks more or less like this:

      y                                                   xs
  <chr>                                                <chr>
1 21.66     2.643227,1.2698358,2.6338573,1.8812188,3.8708665
2 35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823
3 15.22  2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842

Now instead of splitting data into mutlitple tables let's process both part together:

exprs <- lapply(
 0:(n - 1), 
 function(i) paste("CAST(xs[", i, "] AS double) AS x", i, sep=""))

df %>% 
  # Convert to native Spark
  spark_dataframe() %>%
  # Split and select xs, but retain y
  invoke("selectExpr", list("y", "split(xs, ',') AS  xs")) %>%
  invoke("selectExpr", c("CAST(y AS DOUBLE)", exprs)) %>%
  # Register table so we can access it from dplyr
  invoke("registerTempTable", "exploded_df")

and apply summarize_each:

tbl(sc, "exploded_df") %>% summarize_each(funs(corr(., y)), starts_with("x"))

Source:   query [1 x 5]
Database: spark connection master=local[*] app=sparklyr local=TRUE

         x0         x1        x2         x3         x4
      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
1 0.8503358 -0.9972426 0.7242708 -0.9975092 -0.5571591

A quick sanity check (correlation between y and x0, y and x4):

cor(c(21.66, 35.15, 15.22), c(2.643227, 3.422151, 2.8302398))

[1] 0.8503358

cor(c(21.66, 35.15, 15.22), c(3.8708665, 4.0771823, 4.6600842))

[1] -0.5571591

You can of course center the data first:

exploded <- tbl(sc, "exploded_df")

avgs <- summarize_all(exploded, funs(mean)) %>% as.data.frame()
center_exprs <- as.list(paste(colnames(exploded ),"-", avgs))

transmute_(exploded, .dots = setNames(center_exprs, colnames(exploded))) %>% 
  summarize_each(funs(corr(., y)), starts_with("x"))

but it doesn't affect the result:

Source:   query [1 x 5]
Database: spark connection master=local[*] app=sparklyr local=TRUE

         x0         x1        x2         x3         x4
      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
1 0.8503358 -0.9972426 0.7242708 -0.9975092 -0.5571591

If both the transmute_ and summarize_each causes some issue we can push the centering and correlation directly into Spark:

#Centering
center_exprs <- as.list(paste(colnames(exploded ),"-", avgs))

exploded %>%  
  spark_dataframe() %>% 
  invoke("selectExpr", center_exprs) %>% 
  invoke("toDF", as.list(colnames(exploded))) %>%
  invoke("registerTempTable", "centered")

centered <- tbl(sc, "centered")

#Correlation
corr_exprs <- lapply(
  0:(n - 1), 
  function(i) paste("corr(y, x", i, ") AS x", i, sep=""))

centered %>% 
  spark_dataframe() %>% 
  invoke("selectExpr", corr_exprs) %>% 
  invoke("registerTempTable", "corrs")

 tbl(sc, "corrs")

Source:   query [1 x 5]
Database: spark connection master=local[*] app=sparklyr local=TRUE

         x0         x1        x2         x3         x4
      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
1 0.8503358 -0.9972426 0.7242708 -0.9975092 -0.5571591

Intermediate table is of course not necessary and this could be applied at the same time as we extract data from arrays.

Splendid, good Sire! The centering part will also cause `C stack error`, but we can use `invoke` to avoid this. To save you from trivial problem, I edited your answer and added the centering part so that other readers can also reference to this. — Benny Suryajaya, Apr 26 '17 at 17:36

Sparklyr: how to calculate correlation coefficient between 2 Spark tables?

1 Answers1

Linked