Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is…
I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine.
I have a simple workflow:
read in ORC files from Amazon S3
filter down to a small subset of rows
select a small subset of…
Typically when one wants to use sparklyr on a custom function (i.e. **non-translated functions) they place them within spark_apply(). However, I've only encountered examples where a single local data frame is either copy_to() or spark_read_csv() to…
Introduction
R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given]
Existing Result
root
|-- contributors : string
|-- created_at : string
|-- entities (struct)
| |-- hashtags (array) :…
Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :(
Running Spark 2.1.1 and sparklyr…
I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so:
businesses %>%
filter(grepl('test', biz_name)) %>%
…
This is my code. I run it in databricks.
library(sparklyr)
library(dplyr)
library(arrow)
sc <- spark_connect(method = "databricks")
tbl_change_db(sc, "prod")
trip_ids <- spark_read_table(sc, "signals",memory=F) %>%
slice_sample(10) %>%…
I have 500 million rows in a spark dataframe. I'm interested in using sample_n from dplyr because it will allow me to explicitly specify the sample size I want. If I were to use sparklyr::sdf_sample(), I would first have to calculate the…
As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both…
In the following example I've loaded a parquet file that contains a nested record of map objects in the meta field. sparklyr seems to do a nice job of dealing with these. However tidyr::unnest does not translate to SQL (or HQL - understandably -…
I want to estimate rolling value-at-risk for a dataset of about 22.5 million observations, thus I want to use sparklyr for fast computation. Here is what I did (using a sample…
Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can…
In base r, it is easy to extract the names of columns (variables) from a data frame
> testdf <- data.frame(a1 = rnorm(1e5), a2 = rnorm(1e5), a3 = rnorm(1e5), a4 = rnorm(1e5), a5 = rnorm(1e5), a6 = rnorm(1e5))
> names(testdf)
[1] "a1" "a2" "a3"…
Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.…
I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark. I tried with sdf_register, but it failed with following error.
In here, df is spark dataframe.
sdf_register(df, name = "my_tbl")
error is,
Error:…