-2

I'm working with Datasets and trying to group by and then use map.

I am managing to do it with RDD's but with dataset after group by I don't have the option to use map.

Is there a way I can do it?

10465355
  • 4,481
  • 2
  • 20
  • 44
  • Welcome to Stackoverflow. Before posting a question I encourage you to look if someone already asked the same. Regarding this topic, please refer to [this question](https://stackoverflow.com/questions/38796520/spark-converting-a-dataset-to-rdd) – Tizianoreica Jan 23 '19 at 10:34
  • 1
    Possible duplicate of [Spark converting a Dataset to RDD](https://stackoverflow.com/questions/38796520/spark-converting-a-dataset-to-rdd) – Tizianoreica Jan 23 '19 at 10:35
  • 2
    hey Antonio, i am not looking to use RDD's. i want to do this in dataset in order to have the optimizer benefits – Avshalom Orenstein Jan 23 '19 at 11:53
  • @AvshalomOrenstein You won't get any optimizer benefits here. See [ DataFrame / Dataset groupBy behaviour/optimization](https://stackoverflow.com/q/32902982/6910411) and [Spark 2.0 Dataset vs DataFrame](https://stackoverflow.com/q/40596638/6910411) and the docs quoted in the accepted answer. – zero323 Jan 23 '19 at 12:04
  • @AntonioCalì so what is the best approach if i want to use dataset and optimize the performance? – Avshalom Orenstein Jan 23 '19 at 12:37

1 Answers1

2

You can apply groupByKey:

def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

which returns KeyValueGroupedDataset and then mapGroups:

def mapGroups[U](f: (K, Iterator[V]) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.

This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.

Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.

Community
  • 1
  • 1