I would like to replicate the Pandas nunique function with Spark SQL and DataFrame. I have the following:
%spark
import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions._
val df = spark.read
.format("csv")
.option("delimiter", ";")
.option("header", "true") //first line in file has headers
.load("target/youtube_videos.csv")
println("Distinct Count: " + df.distinct().count())
val df2 = df.select(countDistinct("likes"))
df2.show(false)
This works and prints the unique count for the likes column as below:
Distinct Count: 109847
+---------------------+
|count(DISTINCT likes)|
+---------------------+
|27494 |
+---------------------+
How can I do this in one SQL so that I can get a summary of all the individual columns?