7

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa?

All I find on the internet are headlines with when to use a Dataset but when opened, they just specify the differences between Dataframe and a Dataset. There are so many links with just listing differences in the name of scenarios.

There is only one question on stackoverflow that has the right title but even in that answer, the databricks documentation link is not working.

I am looking for some information that can help me understand fundamentally when do we go for a Dataset or in what scenarios is Dataset preferred over Dataframe and vice versa. If not an answer, even a link or documentation that can help me understand is appreciated.

Metadata
  • 2,127
  • 9
  • 56
  • 127
  • I think this answer could help you, as it provides quite a good context, plus the rest of the answers provides valuable information: https://stackoverflow.com/a/39033308/1703619 – Miguel May 18 '22 at 12:17
  • Note that the new Databricks Photon engine (https://docs.databricks.com/runtime/photon.html) supports the DataFrame API but not the Dataset API. Increasingly, moving data to / from Scala objects is an expensive operation, wherever possible data manipulation should be done inside the engine, not in code. – Joe Stevens Jun 26 '22 at 08:59

2 Answers2

1

The page you are looking for is moved to here. According to the session, in summary, Dataset API is available for Scala (and Java) only, and it combines the benefits of both RDD and Dataframes which are:

  1. Functional Programming (RDDs)
  2. Type-safe (RDDs)
  3. Relational (Dataframes)
  4. Catalyst query optimization (Dataframes)
  5. Tunsten direct/packed RAM (Dataframes)
  6. JIT code generation (Dataframes)
  7. Sorting/Shuffling without deserializing (Dataframes)

In addition, Datasets consume less memory and can catch analysis errors at the compile time while it is cached at Runtime for Dataframes. This is also a good article.

Therefore, the answer is you would better use Datasets when you are coding in Scala or Java and want to use functional programming and save more memory with all dataframe capabilities.

0

Datasets are preferred over Dataframes in Apache Spark when the data is strongly typed, i.e., when the schema is known ahead of time and the data is not necessarily homogeneous. This is because Datasets can enforce type safety, which means that type errors will be caught at compile time rather than at runtime. In addition, Datasets can take advantage of the Catalyst optimizer, which can lead to more efficient execution. Finally, Datasets can be easily converted to Dataframes, so there is no need to choose between the two upfront.

Carter McKay
  • 432
  • 2
  • 12