Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

Question

To take advantage of Dataset's optimization, do I have to explicitly use Dataframe's methods (e.g. df.select(col("name"), col("age"), etc) or calling any Dataset's methods - even RDD-like methods (e.g. filter, map, etc) would also allow for optimization?

score 2 · Accepted Answer · answered Feb 23 '17 at 09:08

Dataframe optimization comes in general in 3 flavors:

Tungsten memory management
Catalyst query optimization
wholestage codegen

Tungsten memory management

When defining an RDD[myclass], spark has no real understanding of what myclass is. This means that in general each row will contain an instance of the class.

This has two problems.

The first is the size of the object. A java object has overheads. For example, a case class which contains two simple integers. Doing a sequence of 1000000 instances and turning it into an RDD would take ~26MB while doing the same with dataset/dataframe would take ~2MB.

In addition, this memory when done in dataset/dataframe is not managed by garbage collection (it is managed as unsafe memory internally by spark) and so would have less overhead in GC performance.

Dataset enjoys the same memory management advantages of dataframe. That said, when doing dataset operations, the conversion of the data from the internal (Row) data structure to case class has an overhead in performance.

Catalyst query optimization

When using dataframes functions, spark knows what you are trying to do and sometimes can modify your query to an equivalent one which is more efficient.

Let's say for example that you are doing something like: df.withColumn("a",lit(1)).filter($"b" < ($"a" + 1)).

Basically you are checking if (x < 1 + 1). Spark is smart enough to understand this and change it to x<2.

These kind of operations cannot be done when using dataset operations as spark has no idea on the internals of the functions you are doing.

wholestage codegen

When spark knows what you are doing it can actually generate more efficient code. This can improve performance by a factor of 10 in some cases.

This also cannot be done on dataset functions as spark does not know the internals of the functions.

Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

1 Answers1

Linked