pyspark: counting number of occurrences of each distinct values

Question

I think the question is related to: Spark DataFrame: count distinct values of every column

So basically I have a spark dataframe, with column A has values of 1,1,2,2,1

So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like

distinct_values | number_of_apperance
1 | 3
2 | 2

cronoik · Accepted Answer · 2018-12-06T12:56:37.490

5

I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:

from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]

df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()

+-----+-----+ 
|value|count| 
+-----+-----+ 
|    1|    3|
|    2|    2| 
+-----+-----+

edited Dec 06 '18 at 12:56

answered Dec 06 '18 at 12:49

cronoik

15,434
3
40
78

I was under impression that print data should have user defined names. That's the reason I made it little complex. Anyhow both will work. Thanks – vikrant rana Dec 06 '18 at 15:08
@vikrantrana exactly. This method is simpler but both work. Many thanks. – mommomonthewind Dec 07 '18 at 01:10

score 3 · Answer 2 · answered Dec 06 '18 at 12:37

I am not sure if you are looking for below solution: Here are my thoughts on this. Suppose you have a dataframe like this.

>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])

>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
|  1| AAA|    USA|
|  2| XXX|    CHN|
|  3| KKK|    USA|
|  4| PPP|    USA|
|  5| EEE|    USA|
|  5| HHH|    THA|
+---+----+-------+

I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.

import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()

+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
|             THA|            1|
|             USA|            4|
|             CHN|            1|
+----------------+-------------+

were you looking something similar to this?

pyspark: counting number of occurrences of each distinct values

2 Answers2