I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy
object.
How do I do this action in PySpark?
I am having the following python/pandas command:
df.groupby('Column_Name').agg(lambda x: x.value_counts().max()
where I am getting the value counts for ALL columns in a DataFrameGroupBy
object.
How do I do this action in PySpark?
It's more or less the same:
spark_df.groupBy('column_name').count().orderBy('count')
In the groupBy you can have multiple columns delimited by a ,
For example groupBy('column_1', 'column_2')
try this when you want to control the order:
data.groupBy('col_name').count().orderBy('count', ascending=False).show()
Try this:
spark_df.groupBy('column_name').count().show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()
but spark is much slower than pandas value_counts() on a single machine
df.groupBy('column_name').count().orderBy('count').show()