Highest Voted 'sparklyr' Questions

56

votes

7 answers

SparkR vs sparklyr

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is…

r apache-spark sparkr sparklyr

asked Sep 14 '16 at 15:35

koVex

641
1
6
10

21

votes

2 answers

Out of memory error when collecting data out of Spark cluster

I know there are plenty of questions on SO about out of memory errors on Spark but I haven't found a solution to mine. I have a simple workflow: read in ORC files from Amazon S3 filter down to a small subset of rows select a small subset of…

apache-spark memory sparklyr

asked Aug 25 '17 at 01:35

jay

517
1
7
19

17

votes

0 answers

Creating Spark Objects from JPEG and using spark_apply() on a non-translated function

Typically when one wants to use sparklyr on a custom function (i.e. **non-translated functions) they place them within spark_apply(). However, I've only encountered examples where a single local data frame is either copy_to() or spark_read_csv() to…

r binary jpeg sparklyr

asked Mar 18 '20 at 16:33

Matthew J. Oldach

618
8
24

14

votes

1 answer

How to flatten the data of different data types by using Sparklyr package?

Introduction R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given] Existing Result root |-- contributors : string |-- created_at : string |-- entities (struct) | |-- hashtags (array) :…

r apache-spark nested flatten sparklyr

asked Sep 06 '18 at 00:32

Shree

203
3
22

14

votes

1 answer

Matrix Math With Sparklyr

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :( Running Spark 2.1.1 and sparklyr…

r apache-spark apache-spark-mllib sparklyr

asked Jun 17 '17 at 06:52

Zafar

1,897
15
33

11

votes

1 answer

How to filter on partial match using sparklyr

I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so: businesses %>% filter(grepl('test', biz_name)) %>% …

r apache-spark dplyr sparklyr

asked Sep 18 '17 at 23:19

rookie error

165
1
7

10

votes

3 answers

R and sparklyr: Why is a simple query so slow?

This is my code. I run it in databricks. library(sparklyr) library(dplyr) library(arrow) sc <- spark_connect(method = "databricks") tbl_change_db(sc, "prod") trip_ids <- spark_read_table(sc, "signals",memory=F) %>% slice_sample(10) %>%…

r apache-spark sparklyr

asked Mar 30 '23 at 13:19

Funkwecker

766
13
22

10

votes

1 answer

Is sample_n really a random sample when used with sparklyr?

I have 500 million rows in a spark dataframe. I'm interested in using sample_n from dplyr because it will allow me to explicitly specify the sample size I want. If I were to use sparklyr::sdf_sample(), I would first have to calculate the…

r apache-spark random dplyr sparklyr

asked Jul 24 '18 at 15:28

kputschko

766
1
7
21

10

votes

1 answer

Using SparkR and Sparklyr simultaneously

As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both…

r apache-spark sparkr sparklyr

asked Nov 13 '16 at 19:02

CodingButStillAlive

733
2
8
22

10

votes

4 answers

Is there a way to deal with nested data with sparklyr?

In the following example I've loaded a parquet file that contains a nested record of map objects in the meta field. sparklyr seems to do a nice job of dealing with these. However tidyr::unnest does not translate to SQL (or HQL - understandably -…

r tidyr sparklyr

asked Sep 01 '16 at 16:52

Matt Pollock

1,063
10
26

9

votes

2 answers

rollapply for large data using sparklyr

I want to estimate rolling value-at-risk for a dataset of about 22.5 million observations, thus I want to use sparklyr for fast computation. Here is what I did (using a sample…

r dplyr sparklyr rollapply performanceanalytics

asked Sep 03 '17 at 14:10

Jairaj Gupta

347
4
16

9

votes

2 answers

SparklyR removing a Table from Spark Context

Would like to remove a single data table from the Spark Context ('sc'). I know a single cached table can be un-cached, but this isn't the same as removing an object from the sc -- as far as I can…

r apache-spark rstudio sparklyr

asked Dec 07 '16 at 18:49

eyeOfTheStorm

351
1
5
15

9

votes

1 answer

Simple command for extracting column names in sparklyr (R+spark)

In base r, it is easy to extract the names of columns (variables) from a data frame > testdf <- data.frame(a1 = rnorm(1e5), a2 = rnorm(1e5), a3 = rnorm(1e5), a4 = rnorm(1e5), a5 = rnorm(1e5), a6 = rnorm(1e5)) > names(testdf) [1] "a1" "a2" "a3"…

r apache-spark dplyr sparklyr

asked Oct 11 '16 at 13:56

Prasanna

148
1
8

8

votes

1 answer

Slowdown with repeated calls to spark dataframe in memory

Say I have 40 continuous (DoubleType) variables that I've bucketed into quartiles using ft_quantile_discretizer. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.…

r apache-spark apache-spark-ml sparklyr

asked Aug 21 '18 at 23:23

hgb1234

83
3

8

votes

1 answer

Convert spark dataframe to sparklyR table "tbl_spark"

I'm trying to convert spark dataframe org.apache.spark.sql.DataFrame to a sparklyr table tbl_spark. I tried with sdf_register, but it failed with following error. In here, df is spark dataframe. sdf_register(df, name = "my_tbl") error is, Error:…

r apache-spark sparklyr

asked Jan 16 '18 at 19:48

sen

198
2
9

Questions tagged [sparklyr]