I'm brand new the pyspark (and really python as well). I'm trying to count distinct on each column (not distinct combinations of columns). I want the answer to this SQL statement:
sqlStatement = "Select Count(Distinct C1) AS C1, Count(Distinct C2) AS C2, ..., Count(Distinct CN) AS CN From myTable"
distinct_count = spark.sql(sqlStatement).collect()
That takes forever (16 hours) on an 8-node cluster (see configuration below). I'm trying to optimize a 100GB dataset with 400 columns. I am not seeing a way of using dataframe sql primitives like:
df.agg(countDistinct('C1', 'C2', ..., 'CN'))
as that will again give me unique combinations. There must be a way to make this fast.
Master node
Standard (1 master, N workers)
Machine type
n1-highmem-8 (8 vCPU, 52.0 GB memory)
Primary disk size
500 GB
Worker nodes
8
Machine type
n1-highmem-4 (4 vCPU, 26.0 GB memory)
Primary disk size
500 GB
Local SSDs
1