I loaded a gene expression matrix (33000 rows x 180 samples) in sparklyr and I want to process the probes (rows) instead of the samples (columns).
library(sparklyr)
sc <- spark_connect(master = "local")
library(dplyr)
ge <- read.delim("plier-gcbg-sketch.summary.txt",
sep="\t", comment.char="#")
ge_tbl <- copy_to(sc, ge)
My original idea was to apply:
mean_by_gene <- ge %>% select(-probeset_id) %>%
rowwise() %>% do( data.frame(M=mean(as.numeric(.))) )
But it seems that rowwise is not available in sparklyr. So, after some googling I applied:
ge_tbl2 %>% select(-probeset_id) %>%
spark_apply(function(df) {
data.frame(apply(df, 1, function(x) mean(as.numeric(x))))
} )
This works fine since it generates the data.frame with the mean per row. But I have some doubts about it.
Questions:
- When I test for time (using system.time) the sparklyr is slower than the standard R version (sparklyr: 0.747s, 0.702s, 0.731s; standard R: 0.009, 0.008, 0.008). Why? (For the test I only used 10 probes (rows)).
- When I try to compute the full matrix (33000 rows x 180 columns) instead of a small number of columns (10~100) spark crashes. I assume that I am not processing properly the table, so how I can do this to take advantage of the spark capabilities?
This is a fragment of the error I obtain:
[...]
18/06/13 13:44:21 INFO sparklyr: RScript (4377) found 33297 rows
18/06/13 13:46:08 INFO sparklyr: RScript (4377) retrieved 33297 rows
18/06/13 13:46:19 INFO sparklyr: RScript (4377) computing closure
18/06/13 13:46:51 ERROR sparklyr: RScript (4377) terminated unexpectedly: invalid subscript type 'list'
18/06/13 13:46:51 ERROR sparklyr: Worker (4377) failed to complete R process
[...]