Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
56
votes
7 answers

SparkR vs sparklyr

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is…
koVex
  • 641
  • 1
  • 6
  • 10
21
votes
2 answers

Out of memory error when collecting data out of Spark cluster

I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine. I have a simple workflow: read in ORC files from Amazon S3 filter down to a small subset of rows select a small subset of…
jay
  • 517
  • 1
  • 7
  • 19
17
votes
0 answers

Creating Spark Objects from JPEG and using spark_apply() on a non-translated function

Typically when one wants to use sparklyr on a custom function (i.e. **non-translated functions) they place them within spark_apply(). However, I've only encountered examples where a single local data frame is either copy_to() or spark_read_csv() to…
14
votes
1 answer

How to flatten the data of different data types by using Sparklyr package?

Introduction R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given] Existing Result root |-- contributors : string |-- created_at : string |-- entities (struct) | |-- hashtags (array) :…
Shree
  • 203
  • 3
  • 22
14
votes
1 answer

Matrix Math With Sparklyr

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :( Running Spark 2.1.1 and sparklyr…
Zafar
  • 1,897
  • 15
  • 33
11
votes
1 answer

How to filter on partial match using sparklyr

I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so: businesses %>% filter(grepl('test', biz_name)) %>% …
rookie error
  • 165
  • 1
  • 7
10
votes
3 answers

R and sparklyr: Why is a simple query so slow?

This is my code. I run it in databricks. library(sparklyr) library(dplyr) library(arrow) sc <- spark_connect(method = "databricks") tbl_change_db(sc, "prod") trip_ids <- spark_read_table(sc, "signals",memory=F) %>% slice_sample(10) %>%…
Funkwecker
  • 766
  • 13
  • 22
10
votes
1 answer

Is sample_n really a random sample when used with sparklyr?

I have 500 million rows in a spark dataframe. I'm interested in using sample_n from dplyr because it will allow me to explicitly specify the sample size I want. If I were to use sparklyr::sdf_sample(), I would first have to calculate the…
kputschko
  • 766
  • 1
  • 7
  • 21
10
votes
1 answer

Using SparkR and Sparklyr simultaneously

As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both…
CodingButStillAlive
  • 733
  • 2
  • 8
  • 22
10
votes
4 answers

Is there a way to deal with nested data with sparklyr?

In the following example I've loaded a parquet file that contains a nested record of map objects in the meta field. sparklyr seems to do a nice job of dealing with these. However tidyr::unnest does not translate to SQL (or HQL - understandably -…
Matt Pollock
  • 1,063
  • 10
  • 26
9
votes
2 answers

rollapply for large data using sparklyr

I want to estimate rolling value-at-risk for a dataset of about 22.5 million observations, thus I want to use sparklyr for fast computation. Here is what I did (using a sample…
Jairaj Gupta
  • 347
  • 4
  • 16
9
votes
2 answers

SparklyR removing a Table from Spark Context

Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can…
eyeOfTheStorm
  • 351
  • 1
  • 5
  • 15
9
votes
1 answer

Simple command for extracting column names in sparklyr (R+spark)

In base r, it is easy to extract the names of columns (variables) from a data frame > testdf <- data.frame(a1 = rnorm(1e5), a2 = rnorm(1e5), a3 = rnorm(1e5), a4 = rnorm(1e5), a5 = rnorm(1e5), a6 = rnorm(1e5)) > names(testdf) [1] "a1" "a2" "a3"…
Prasanna
  • 148
  • 1
  • 8
8
votes
1 answer

Slowdown with repeated calls to spark dataframe in memory

Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.…
hgb1234
  • 83
  • 3
8
votes
1 answer

Convert spark dataframe to sparklyR table "tbl_spark"

I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark. I tried with sdf_register, but it failed with following error. In here, df is spark dataframe. sdf_register(df, name = "my_tbl") error is, Error:…
sen
  • 198
  • 2
  • 9
1
2 3
52 53