0

I have a Spark DataFrame which contains some application usage data. I'm aiming to collect certain metrics from this DataFrame, and then accumulate them together.

For instance, I may want to obtain a total number of users of my product in this DataFrame:

df.select($"user").count.distinct
100500

And then I want to builds stats across different application versions

df.groupBy("version").count.toJSON.show(false)

+-----------------------------------------+
|value                                    |
+-----------------------------------------+
|{"version":"1.2.3.4","count":4051}  |
|{"version":"1.2.3.5","count":1}     |
|{"version":"1.2.4.6","count":1}     |
|{"version":"2.0.0.1","count":30433} |
|{"version":"3.1.2.3","count":112195}|
|{"version":"3.1.0.4","count":11457} |
+-----------------------------------------+

Then I'd like to squash the records in the second DF, so in the end I need to have an object like this:

{ "totalUsers":100500, "versions":[
  {"version":"1.2.3.4","count":4051},
  {"version":"1.2.3.5","count":1},
  {"version":"1.2.4.6","count":1},
  {"version":"2.0.0.1","count":30433},
  {"version":"3.1.2.3","count":112195},
  {"version":"3.1.0.4","count":11457}] }

Then this object shall be written to another spark DF.

What could be the right way to implement this?

Disclaimer: I'm quite new to spark, so I'm sorry if my question is too noobish. I've read a plenty of similar questions, including seemingly similar ones like this and this. The latter is close, but still doesn't give a clue on how to accumulate multiple rows into one object. Neither was I able to understand it from the Apache Spark docs.

Vasiliy Galkin
  • 1,894
  • 1
  • 14
  • 25

1 Answers1

0

Try to use collect_list function, for example:

from pyspark.sql import functions as F
from pyspark.sql.functions import lit
totalUsers = 100500
agg = df.groupBy().agg(F.collect_list("value").alias('versions')).withColumn("totalUsers", lit(totalUsers)).show()

Where df is data frame with aggregated versions. I get the following result:

+--------------------+----------+
|            versions|totalUsers|
+--------------------+----------+
|[{"version":"1.2....|    100500|
+--------------------+----------+

My example is written in Python but I believe the same approach you can use for your language.

statut
  • 849
  • 5
  • 14