I have a Spark DataFrame which contains some application usage data. I'm aiming to collect certain metrics from this DataFrame, and then accumulate them together.
For instance, I may want to obtain a total number of users of my product in this DataFrame:
df.select($"user").count.distinct
100500
And then I want to builds stats across different application versions
df.groupBy("version").count.toJSON.show(false)
+-----------------------------------------+
|value |
+-----------------------------------------+
|{"version":"1.2.3.4","count":4051} |
|{"version":"1.2.3.5","count":1} |
|{"version":"1.2.4.6","count":1} |
|{"version":"2.0.0.1","count":30433} |
|{"version":"3.1.2.3","count":112195}|
|{"version":"3.1.0.4","count":11457} |
+-----------------------------------------+
Then I'd like to squash the records in the second DF, so in the end I need to have an object like this:
{ "totalUsers":100500, "versions":[
{"version":"1.2.3.4","count":4051},
{"version":"1.2.3.5","count":1},
{"version":"1.2.4.6","count":1},
{"version":"2.0.0.1","count":30433},
{"version":"3.1.2.3","count":112195},
{"version":"3.1.0.4","count":11457}] }
Then this object shall be written to another spark DF.
What could be the right way to implement this?
Disclaimer: I'm quite new to spark, so I'm sorry if my question is too noobish. I've read a plenty of similar questions, including seemingly similar ones like this and this. The latter is close, but still doesn't give a clue on how to accumulate multiple rows into one object. Neither was I able to understand it from the Apache Spark docs.