I use R on Zeppelin at work to develop machine learning models. I extract the data from Hive tables using %sparkr, sql(Constring, 'select * from table')
and in default it generates a spark data frame with 94 Million records.
However, I cannot perform all R data munging tasks on this Spark df, so I try to convert it to an R data frame using Collect(), as.data.frame()
but I run into memory node/ time-out issues.
I was wondering if stack overflow community is aware of any other way to convert a Spark df to R df by avoiding time-out issues?

- 387
- 4
- 20
-
1That's by definition what is required to have R dataframe. – Alper t. Turker Aug 09 '18 at 00:36
-
what if conversion times-out? – Rudr Aug 09 '18 at 00:38
-
Then you shouldn't use these. You can rather check `gapply` or `dapply` methods which operate on chunks of data in distributed fashion. – Alper t. Turker Aug 09 '18 at 00:42
-
How about using sparklyr package, any inputs? – Rudr Aug 09 '18 at 13:41
-
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177752/discussion-between-rudr-and-user8371915). – Rudr Aug 09 '18 at 17:42
-
Sparklyr is not much different - it just adds another layer of indirection by pretending Spark is a database. – Alper t. Turker Aug 09 '18 at 17:45
1 Answers
Did you try to cache your spark dataframe first? If you cache your data first, it may help speed up the collect as the data is already in RAM...that could get rid of the timeout problem. At the same time, this would only increase your RAM requirements. I too have seen those timeout issues when you are trying to serialize or deserialize certain data types, or just large amounts of data between R and Spark. Serialization and deserialization for large data sets is far from a "bullet proof" operation with R and Spark. Moreover, 94M records may just be too much for your driver node to handle in the first place, especially if there is a lot of dimensionality to your dataset.
One workaround I've used, but am not proud of is to use spark to write out the dataframe as a CSV and then have R read that CSV file back in on the next line of the script. Oddly enough, in a few of the cases I did this, the write a file and read the file method actually ended up being faster than a simple collect
operation. A lot faster.
Word of advice- make sure to watch out for partitioning when writing out csv files with spark. You'll get a bunch of csv files and have to do some sort of tmp<- lapply(list_of_csv_files_from_spark, function(x){read.csv(x)})
operation to read in each csv file individually and then maybe a df<- do.call("rbind", tmp)
...it would probably be best to use fread
to read in the csvs in place of read.csv
as well.
Perhaps the better question is, what other data munging tasks are you unable to do in Spark that you need R for?
Good luck. I hope this was helpful. -nate

- 1,172
- 1
- 11
- 26
-
well, in my case I was trying to use dplyr methods on a spark data frame and Zeppelin was not allowing. So, I thought let's convert spark df to an R df may be that would resolve the issue however, my data is huge and conversion was not an option(at least from the above discussion).Seems like I found a way to work around using SparkR package which has most of the methods in dplyr package! https://spark.apache.org/docs/2.3.0/api/R/select.html – Rudr Aug 09 '18 at 20:39
-
1Glad you were able to figure it out. dplyr and SparkR essentially do the same stuff...where that stuff is really just an SQL operation. SparkR just allows you to do that in a distributed, parallelized manner...which is still pretty cool, IMO. – nate Aug 10 '18 at 17:08