Spark DataFrame Unique On All Columns Individually

Question

I would like to replicate the Pandas nunique function with Spark SQL and DataFrame. I have the following:

%spark

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions._ 

val df = spark.read
        .format("csv")
        .option("delimiter", ";")
        .option("header", "true") //first line in file has headers
        .load("target/youtube_videos.csv")
        
println("Distinct Count: " + df.distinct().count())
        
val df2 = df.select(countDistinct("likes"))
df2.show(false)

This works and prints the unique count for the likes column as below:

Distinct Count: 109847
+---------------------+
|count(DISTINCT likes)|
+---------------------+
|27494                |
+---------------------+

How can I do this in one SQL so that I can get a summary of all the individual columns?

Does this answer your question? [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column) — Gabio, Nov 05 '21 at 19:54

Vaebhav · Accepted Answer · 2021-11-06T05:48:30.520

1

You can easily iterate on all the columns within your DataFrame and collect your result , else you can convert it to a DataFrame as well

import pyspark.sql.functions as F


data = {"col1" : [np.random.randint(10) for x in range(1,10)],
    "col2" : [np.random.randint(100) for x in range(1,10)]}
mypd = pd.DataFrame(data)


sparkDF = sql.createDataFrame(mypd)

sparkDF.show()

+----+----+
|col1|col2|
+----+----+
|   4|  54|
|   2|  90|
|   4|  70|
|   4|  37|
|   7|  63|
|   8|  59|
|   0|  52|
|   2|  76|
|   7|  79|
+----+----+

Iterating DataFrame Columns

sparkDF.select([F.countDistinct(F.col(c)).alias(c) for c in sparkDF.columns]).show()

+----+----+
|col1|col2|
+----+----+
|   5|   9|
+----+----+

edited Nov 06 '21 at 05:48

answered Nov 05 '21 at 12:37

Vaebhav

4,672
1
13
33

Is that F coming from some implicit imports? – joesan Nov 05 '21 at 12:52
@joesan It's Python. But you should be able to use the same strategy in Scala. – Jasper-M Nov 05 '21 at 12:54
@joesan- Updated the answer to include the import statements , and yep the same can be used with scala as well – Vaebhav Nov 06 '21 at 05:49

score 1 · Answer 2 · answered Nov 05 '21 at 19:53

I was able to get the results like I wanted as below:

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
        
// Print the unique values for each column
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*).show(false)

The snippet above prints:

+-------------+----------+-----------+--------+-----------------+-----+-----+--------+--------+--------+--------------------+--------------------+--------------+
|channel_title|channel_id|video_title|video_id|video_upload_date|views|likes|dislikes|comments|age_days|likes_dislikes_ratio|comments_views_ratio|mean_views_day|
+-------------+----------+-----------+--------+-----------------+-----+-----+--------+--------+--------+--------------------+--------------------+--------------+
|127          |127       |108980     |109846  |109692           |86107|27494|5457    |7104    |5195    |69021               |100809              |109714        |
+-------------+----------+-----------+--------+-----------------+-----+-----+--------+--------+--------+--------------------+--------------------+--------------+

Spark DataFrame Unique On All Columns Individually

2 Answers2