Questions tagged [apache-spark-dataset]

Spark Dataset is a strongly typed collection of objects mapped to a relational schema. It supports the similar optimizations to Spark DataFrames providing type-safe programming interface at the same time.

External links:

Related tags: , , ,

950 questions
340
votes
14 answers

Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?
165
votes
9 answers

How to store custom objects in Dataset?

According to Introducing Spark Datasets: As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to…
zero323
  • 322,348
  • 103
  • 959
  • 935
66
votes
3 answers

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported…
clay
  • 18,138
  • 28
  • 107
  • 192
49
votes
0 answers

Difference between DataSet API and DataFrame API

Does anyone can help me to understand difference between DataSet API and DataFrame API with an example? Why there was there a need to introduce the DataSet API in Spark?
Shashi
  • 2,686
  • 7
  • 35
  • 67
43
votes
4 answers

Encoder error while trying to map dataframe row to updated row

When I m trying to do the same thing in my code as mentioned below dataframe.map(row => { val row1 = row.getAs[String](1) val make = if (row1.toLowerCase == "tesla") "S" else row1 Row(row(0),make,row(2)) }) I have taken the above reference…
40
votes
5 answers

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context using a SparkSession? Can I completely replace all the Contexts using one single entry SparkSession? Are…
37
votes
1 answer

DataFrame / Dataset groupBy behaviour/optimization

Suppose we have DataFrame df consisting of the following columns: Name, Surname, Size, Width, Length, Weigh Now we want to perform a couple of operations, for example we want to create a couple of DataFrames containing data about Size and…
35
votes
2 answers

Encoder for Row Type Spark Datasets

I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In the example below, instead of returning…
34
votes
2 answers

Perform a typed join in Scala with Spark Datasets

I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example,…
Sparky
  • 717
  • 1
  • 7
  • 17
32
votes
3 answers

spark createOrReplaceTempView vs createGlobalTempView

Spark Dataset 2.0 provides two functions createOrReplaceTempView and createGlobalTempView. I am not able to understand the basic difference between both functions. According to API documents: createOrReplaceTempView: The lifetime of this temporary…
Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91
32
votes
3 answers

Spark 2.0 Dataset vs DataFrame

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: What is the difference between df.select("foo") df.select($"foo") do I understand correctly…
27
votes
3 answers

Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only…
Madhava Carrillo
  • 3,998
  • 3
  • 18
  • 24
27
votes
4 answers

How to change case of whole column to lowercase?

I want to Change case of whole column to Lowercase in Spark Dataset Desired Input +------+--------------------+ |ItemID| Category name| +------+--------------------+ | ABC|BRUSH & BROOM HAN...| …
Shreeharsha
  • 914
  • 1
  • 10
  • 21
25
votes
3 answers

Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join. Let's say I have two dataset with fields: date | value, then in the case of DataFrame my join would look like: val dfA : DataFrame val dfB :…
mastro
  • 619
  • 1
  • 8
  • 17
23
votes
3 answers

How to create a custom Encoder in Spark 2.X Datasets?

Spark Datasets move away from Row's to Encoder's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a…
1
2 3
63 64