Collect rows from spark DataFrame into JSON object, then put the object to another DF

Question

I have a Spark DataFrame which contains some application usage data. I'm aiming to collect certain metrics from this DataFrame, and then accumulate them together.

For instance, I may want to obtain a total number of users of my product in this DataFrame:

df.select($"user").count.distinct
100500

And then I want to builds stats across different application versions

df.groupBy("version").count.toJSON.show(false)

+-----------------------------------------+
|value                                    |
+-----------------------------------------+
|{"version":"1.2.3.4","count":4051}  |
|{"version":"1.2.3.5","count":1}     |
|{"version":"1.2.4.6","count":1}     |
|{"version":"2.0.0.1","count":30433} |
|{"version":"3.1.2.3","count":112195}|
|{"version":"3.1.0.4","count":11457} |
+-----------------------------------------+

Then I'd like to squash the records in the second DF, so in the end I need to have an object like this:

{ "totalUsers":100500, "versions":[
  {"version":"1.2.3.4","count":4051},
  {"version":"1.2.3.5","count":1},
  {"version":"1.2.4.6","count":1},
  {"version":"2.0.0.1","count":30433},
  {"version":"3.1.2.3","count":112195},
  {"version":"3.1.0.4","count":11457}] }

Then this object shall be written to another spark DF.

What could be the right way to implement this?

Disclaimer: I'm quite new to spark, so I'm sorry if my question is too noobish. I've read a plenty of similar questions, including seemingly similar ones like this and this. The latter is close, but still doesn't give a clue on how to accumulate multiple rows into one object. Neither was I able to understand it from the Apache Spark docs.

So, you want to end up with single instance of object `{ "totalUsers":100500, "versions":[. ...}` ??? — Thang Nguyen, Mar 20 '18 at 09:23
@VasiliyGalkin please see here https://stackoverflow.com/questions/46482058/apache-spark-concatenate-multiple-rows-into-list-in-single-row — statut, Mar 20 '18 at 12:05
@statut Thanks for the link. However, it explains how to concatenate data between multiple columns, while I'm looking for rows concatenation. — Vasiliy Galkin, Mar 20 '18 at 13:06
@VasiliyGalkin can you post the original dataframe (df) ? some sample lines would help us test and give you correct solution — Ramesh Maharjan, Mar 20 '18 at 13:28

statut · Answer 1 · 2018-03-20T13:40:13.660

Try to use collect_list function, for example:

from pyspark.sql import functions as F
from pyspark.sql.functions import lit
totalUsers = 100500
agg = df.groupBy().agg(F.collect_list("value").alias('versions')).withColumn("totalUsers", lit(totalUsers)).show()

Where df is data frame with aggregated versions. I get the following result:

+--------------------+----------+
|            versions|totalUsers|
+--------------------+----------+
|[{"version":"1.2....|    100500|
+--------------------+----------+

My example is written in Python but I believe the same approach you can use for your language.

Collect rows from spark DataFrame into JSON object, then put the object to another DF

1 Answers1