0

I loaded a gene expression matrix (33000 rows x 180 samples) in sparklyr and I want to process the probes (rows) instead of the samples (columns).

library(sparklyr) 
sc <- spark_connect(master = "local")
library(dplyr)
ge <- read.delim("plier-gcbg-sketch.summary.txt", 
    sep="\t", comment.char="#")
ge_tbl <- copy_to(sc, ge)

My original idea was to apply:

mean_by_gene <- ge %>% select(-probeset_id) %>% 
   rowwise() %>% do( data.frame(M=mean(as.numeric(.))) )

But it seems that rowwise is not available in sparklyr. So, after some googling I applied:

ge_tbl2 %>% select(-probeset_id) %>% 
    spark_apply(function(df) { 
        data.frame(apply(df, 1, function(x) mean(as.numeric(x)))) 
    } )

This works fine since it generates the data.frame with the mean per row. But I have some doubts about it.

Questions:

  1. When I test for time (using system.time) the sparklyr is slower than the standard R version (sparklyr: 0.747s, 0.702s, 0.731s; standard R: 0.009, 0.008, 0.008). Why? (For the test I only used 10 probes (rows)).
  2. When I try to compute the full matrix (33000 rows x 180 columns) instead of a small number of columns (10~100) spark crashes. I assume that I am not processing properly the table, so how I can do this to take advantage of the spark capabilities?

This is a fragment of the error I obtain:

[...]
18/06/13 13:44:21 INFO sparklyr: RScript (4377) found 33297 rows
18/06/13 13:46:08 INFO sparklyr: RScript (4377) retrieved 33297 rows
18/06/13 13:46:19 INFO sparklyr: RScript (4377) computing closure
18/06/13 13:46:51 ERROR sparklyr: RScript (4377) terminated unexpectedly: invalid subscript type 'list'
18/06/13 13:46:51 ERROR sparklyr: Worker (4377) failed to complete R process
[...]
carlesh
  • 537
  • 1
  • 4
  • 17
  • The size of your data doesn't justify usage of Spark and `local` mode + `copy_to` + `spark_apply` give you essentially nothing, even if it did. My honest advice is to drop the idea - there is nothing to gain here. If you really want to go this way despite that, start with proper cluster and Spark reader methods (`spark_read_csv`). – zero323 Jun 13 '18 at 14:42
  • And to give you some insight about speed - [Why is Apache-Spark - Python so slow locally as compared to pandas?](https://stackoverflow.com/q/48815341/6910411). Just substitute "Pandas" with "R" and every statement there will hold. – zero323 Jun 13 '18 at 14:47
  • @user6910411 I will update the text but this is, somehow, a proof of concept as part of a larger pipeline. We want to repeat this process many times with larger data. I will take your advise and use Spark reader methods instead of dplyr methods. I am currently working in a small test cluster with one master node and 2 worker nodes. – carlesh Jun 13 '18 at 14:59
  • Realistically, how large your data can get? 70000 probes a few thousand samples? Still nothing that could really benefit from Spark, especially with `sparklyr`. And many standard processing step will just not scale well in a distributed system like this. – zero323 Jun 13 '18 at 15:00
  • Okay, so you suggest top working on this with some dozens of tables with 10M probes for 5K samples (mixed data, not only gene expression), right? In any circumstances, I would like to know the answers. – carlesh Jun 13 '18 at 15:03
  • Well, maybe big picture justifies Spark ([Adam](https://github.com/bigdatagenomics/adam) and [Hail](https://github.com/hail-is/hail)) show that you can build useful omics tools on top of Spark. But looking at this question the answer is the same for the linked Pandas question - wrong tool for the job. One way or another, if you want to build tools on top of Spark I strongly recommend using native (Scala) API directly - starting on top of guest solution will just get in your way. – zero323 Jun 13 '18 at 15:06

0 Answers0