Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

SPARK-9999 - Dataset API on top of Catalyst/DataFrame
Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zaharia. Introducing Spark Datasets. https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Related tags: apache-spark, apache-spark-sql, spark-dataframe, rdd

950 questions

340

votes

14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset

asked Jul 20 '15 at 02:31

oikonomiyaki

7,691
15
62
101

165

votes

9 answers

How to store custom objects in Dataset?

According to Introducing Spark Datasets: As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Apr 15 '16 at 13:11

zero323

322,348
103
959
935

votes

3 answers

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Jul 29 '16 at 18:04

clay

18,138
28
107
192

votes

0 answers

Difference between DataSet API and DataFrame API

Does anyone can help me to understand difference between DataSet API and DataFrame API with an example? Why there was there a need to introduce the DataSet API in Spark?

apache-spark apache-spark-sql rdd apache-spark-dataset

asked May 18 '16 at 13:33

Shashi

2,686
7
35
67

votes

4 answers

Encoder error while trying to map dataframe row to updated row

When I m trying to do the same thing in my code as mentioned below dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) I have taken the above reference…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

asked Sep 11 '16 at 06:21

Advika

votes

5 answers

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context using a SparkSession? Can I completely replace all the Contexts using one single entry SparkSession? Are…

java scala apache-spark rdd apache-spark-dataset

asked May 05 '17 at 10:37

Manikandan Balasubramanian

1,079
4
14
27

votes

1 answer

DataFrame / Dataset groupBy behaviour/optimization

Suppose we have DataFrame df consisting of the following columns: Name, Surname, Size, Width, Length, Weigh Now we want to perform a couple of operations, for example we want to create a couple of DataFrames containing data about Size and…

performance apache-spark dataframe apache-spark-sql apache-spark-dataset

asked Oct 02 '15 at 08:08

TheMP

8,257
9
44
73

votes

2 answers

Encoder for Row Type Spark Datasets

I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In the example below, instead of returning…

java apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

asked Apr 05 '17 at 18:13

tsar2512

2,826
3
33
61

votes

2 answers

Perform a typed join in Scala with Spark Datasets

I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example,…

scala apache-spark join apache-spark-sql apache-spark-dataset

asked Nov 15 '16 at 08:30

Sparky

votes

3 answers

spark createOrReplaceTempView vs createGlobalTempView

Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView. I am not able to understand the basic difference between both functions. According to API documents: createOrReplaceTempView: The lifetime of this temporary…

apache-spark apache-spark-dataset

asked Mar 13 '17 at 21:58

Rahul Sharma

5,614
10
57
91

votes

3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…

scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

asked Nov 14 '16 at 19:44

Georg Heiler

16,916
36
162
292

votes

3 answers

Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only…

apache-spark hive apache-spark-dataset

asked Apr 24 '18 at 16:20

Madhava Carrillo

3,998
3
18
24

votes

4 answers

How to change case of whole column to lowercase?

I want to Change case of whole column to Lowercase in Spark Dataset Desired Input +------+--------------------+ |ItemID| Category name| +------+--------------------+ | ABC|BRUSH & BROOM HAN...| …

java apache-spark apache-spark-sql apache-spark-dataset

asked Apr 19 '17 at 16:06

Shreeharsha

votes

3 answers

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join. Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like: val dfA : DataFrame val dfB :…

scala apache-spark apache-spark-sql apache-spark-dataset

asked Apr 06 '16 at 21:26

mastro

votes

3 answers

How to create a custom Encoder in Spark 2.X Datasets?

Spark Datasets move away from Row's to Encoder's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Jun 08 '16 at 15:10

WestCoastProjects

58,982
91
316
560

2 3

…

63 64 Next