countDistinct
is probably the first choice:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("some_column"))
If speed is more important than the accuracy you may consider approx_count_distinct
(approxCountDistinct
in Spark 1.x):
import org.apache.spark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct("some_column"))
To get values and counts:
df.groupBy("some_column").count()
In SQL (spark-sql
):
SELECT COUNT(DISTINCT some_column) FROM df
and
SELECT approx_count_distinct(some_column) FROM df