Should the DataFrame function groupBy be avoided?

Question

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?

I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).

Essentially, I am trying to do is:

# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)

However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.

Shaido · Answer 1 · 2018-02-21T05:30:58.453

4

A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.

When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.

The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:

from pyspark.sql import functions as F

joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))

In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.

edited Feb 21 '18 at 05:30

answered Feb 20 '18 at 02:28

Shaido

27,497
23
70
73

The problem with aggregation is that it throws away all columns except A and MaxValue. I need to keep the original columns. Effectively, this is supposed to be a complex filtering operation. – Thomas Feb 20 '18 at 08:08
1

@Thomas: `groupBy` with `agg` will not move any data unnecessarily, see for example the question [here](https://stackoverflow.com/questions/32902982/dataframe-dataset-groupby-behaviour-optimization). If you need to keep the other columns you could try creating a struct following the third alternative in the linked question answer (make sure that the column to use as comparator is the first on in the struct). – Shaido Feb 20 '18 at 08:43

score 0 · Answer 2 · answered Feb 19 '18 at 17:05

The documentation is pretty all over the place.

There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."

There is a lot of heavy wizardry behind the scenes.

I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.

Also, for performance improvements, I suggest fiddling (through trial and error) with:

the spark configurations Doc
shuffle.partitions Doc

Should the DataFrame function groupBy be avoided?

2 Answers2

Linked