I'm trying to display a distinct count of a couple different columns in a spark dataframe, and also the record count after grouping the first column.
So if I had col1, col2, and col3, I want to groupBy col1, and then display a distinct count of col2 and also a distinct count of col3.
Then, I would like to display the record count after that same groupBy of col1.
And finally, do this all in one agg statement..
Any ideas?
Asked
Active
Viewed 2,408 times
0

CaroV1x3n
- 135
- 2
- 2
- 10
-
Possible duplicate of [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column) – Alper t. Turker May 03 '18 at 21:11
1 Answers
1
Below is the code you are looking for
df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
=======Tested Below============
scala> val lst = List(("a","x","d"),("b","D","s"),("ss","kk","ll"),("a","y","e"),("b","c","y"),("a","x","y"));
lst: List[(String, String, String)] = List((a,x,d), (b,D,s), (ss,kk,ll), (a,y,e), (b,c,y), (a,x,y))
scala> val rdd=sc.makeRDD(lst);
rdd: org.apache.spark.rdd.RDD[(String, String, String)] = ParallelCollectionRDD[7] at makeRDD at <console>:26
scala> val df = rdd.toDF("COL1","COL2","COL3");
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.printSchema
root
|-- COL1: string (nullable = true)
|-- COL2: string (nullable = true)
|-- COL3: string (nullable = true)
scala> df.groupBy("COL1").agg(countDistinct("COL2"),countDistinct("COL3"),count($"*")).show
+----+--------------------+--------------------+--------+
|COL1|count(DISTINCT COL2)|count(DISTINCT COL3)|count(1)|
+----+--------------------+--------------------+--------+
| ss| 1| 1| 1|
| b| 2| 2| 2|
| a| 2| 3| 3|
+----+--------------------+--------------------+--------+
scala>

Rajendra Pallala
- 74
- 3
-
Thank you for your reply. Can you explain how the count($"*") does a record count of the first column after it is grouped? – CaroV1x3n May 04 '18 at 14:59
-
Also, would there be a reason why the distinct counts would all be 1 when I do the countDistincts? – CaroV1x3n May 04 '18 at 15:01
-
Count($”*”) means just count(*) as in other rdbms sql like Oracle, this means count all .. it counts all records within the group – Rajendra Pallala May 04 '18 at 15:04
-
count($"*") , count("*") or count("COL1") all three will yield same result.. basically counting all rows in a group – Rajendra Pallala May 05 '18 at 22:52