Questions tagged [sparkr]

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.

SparkR is a package that provides a light-weight frontend to use from R.

SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.

SparkR exposes the RDD API of Spark as distributed lists in R.

Related Packages:

References:

796 questions
61
votes
10 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…
metasim
  • 4,793
  • 3
  • 46
  • 70
56
votes
7 answers

SparkR vs sparklyr

Does someone have an overview with respect to advantages/disadvantages of SparkR vs sparklyr? Google does not yield any satisfactory results and both seem fairly similar. Trying both out, SparkR appears a lot more cumbersome, whereas sparklyr is…
koVex
  • 641
  • 1
  • 6
  • 10
52
votes
4 answers

Installing of SparkR

I have the last version of R - 3.2.1. Now I want to install SparkR on R. After I execute: > install.packages("SparkR") I got back: Installing package into ‘/home/user/R/x86_64-pc-linux-gnu-library/3.2’ (as ‘lib’ is unspecified) Warning in…
Guforu
  • 3,835
  • 8
  • 33
  • 52
35
votes
3 answers

Difference between createOrReplaceTempView and registerTempTable

I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView() and registerTempTable(). What is the difference between the two commands?. They seem to have same set of…
Amogh Huilgol
  • 1,252
  • 3
  • 18
  • 25
14
votes
6 answers

Summing multiple columns in Spark

How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df, I get an error. # Create SparkDataFrame df <- createDataFrame(faithful) # Use…
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
14
votes
7 answers

Unable to launch SparkR in RStudio

After long and difficult installation process of SparkR i getting into new problems of launching SparkR. My Settings R 3.2.0 RStudio 0.98.1103 Rtools 3.3 Spark 1.4.0 Java Version 8 SparkR 1.4.0 Windows 7 SP 1 64 Bit Now i try to use…
Patrick C.
  • 2,221
  • 1
  • 11
  • 15
13
votes
2 answers

Add column to DataFrame in sparkR

I would like to add a column filled with a character N in a DataFrame in SparkR. I would do it like that with non-SparkR code : df$new_column <- "N" But with SparkR, I get the following error : Error: class(value) == "Column" || is.null(value) is…
François M.
  • 4,027
  • 11
  • 30
  • 81
12
votes
1 answer

Using SparkR JVM to call methods from a Scala jar file

I wanted to be able to package DataFrames in a Scala jar file and access them in R. The end goal is to create a way to access specific and often-used database tables in Python, R, and Scala without writing a different library for each. To do this,…
mfliu
  • 121
  • 4
10
votes
1 answer

Using SparkR and Sparklyr simultaneously

As far as I understood, those two packages provide similar but mostly different wrapper functions for Apache Spark. Sparklyr is newer and still needs to grow in the scope of functionality. I therefore think that one currently needs to use both…
CodingButStillAlive
  • 733
  • 2
  • 8
  • 22
10
votes
4 answers

Duplicate columns in Spark Dataframe

I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv package to parse it as DataFrame: df <- read.df( sqlContext, FILE_PATH, source = "com.databricks.spark.csv", header =…
Bamqf
  • 3,382
  • 8
  • 33
  • 47
9
votes
2 answers

How to handle null entries in SparkR

I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.…
Ole Petersen
  • 670
  • 9
  • 21
8
votes
2 answers

How to call Sagemaker training model endpoint API in C#

I have implemented machine learning algorithms through sagemaker. I have installed SDK for .net, and tried by executing below code. Uri sagemakerEndPointURI = new…
Diboliya
  • 1,124
  • 3
  • 15
  • 38
8
votes
2 answers

Why is collect in SparkR so slow?

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM. To facilitate construction of a dataset I…
Wil Van Cleve
  • 91
  • 1
  • 4
8
votes
1 answer

zeppelin with sparkr is not displaying dataframe as table

The zeppelin R interpreter documentation states: If you return a data.frame, Zeppelin will attempt to display it using Zeppelin's built-in visualizations. This can be seen in the documentation example: However, when I attempt to run the same R…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
7
votes
3 answers

Convert date to end of month in Spark

I have a Spark DataFrame as shown below: #Create DataFrame df <- data.frame(name = c("Thomas", "William", "Bill", "John"), dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08')) df <- createDataFrame(df) #Make sure df$dates…
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
1
2 3
53 54