pyspark count distinct on each column

Question

I'm brand new the pyspark (and really python as well). I'm trying to count distinct on each column (not distinct combinations of columns). I want the answer to this SQL statement:

sqlStatement = "Select Count(Distinct C1) AS C1, Count(Distinct C2) AS C2, ..., Count(Distinct CN) AS CN From myTable"

distinct_count = spark.sql(sqlStatement).collect()

That takes forever (16 hours) on an 8-node cluster (see configuration below). I'm trying to optimize a 100GB dataset with 400 columns. I am not seeing a way of using dataframe sql primitives like:

df.agg(countDistinct('C1', 'C2', ..., 'CN'))

as that will again give me unique combinations. There must be a way to make this fast.

Master node Standard (1 master, N workers) Machine type
n1-highmem-8 (8 vCPU, 52.0 GB memory) Primary disk size
500 GB Worker nodes
8 Machine type
n1-highmem-4 (4 vCPU, 26.0 GB memory) Primary disk size
500 GB Local SSDs
1

`approx_count_distinct` as [here](https://stackoverflow.com/a/40889920/9613318)? `Count(Distinct x)` is __really__ bad. — Alper t. Turker, May 14 '18 at 21:48
Thanks. That is a helpful post. I did a lot of googling and for some reason that post did not come up! — breakingduck, May 14 '18 at 21:58
The other thing I'm trying to do is run a describe() to get all the basic stats. That was also slow. It was 2 hours out of my 16. Any help on speeding up describe() would also be appreciated. — breakingduck, May 14 '18 at 22:01
Maybe you could benefit from using `cache()`, transformations will be recomputed if you perform multiple actions on the dataframe. — Shaido, Jun 28 '18 at 02:09

score 1 · Answer 1 · edited Jun 28 '18 at 00:57

Note that you are using the .collect() method which returns all elements of the dataset to the driver and this may cause the driver to run out of memory. See this link for explanation.

You can see what is being passed by running .explain() on your query:

myquery = spark.sql(sqlStatement)
myquery.explain()

You could ease the problem by splitting up your query into multiple queries so you do not calculate distinct() on every column at once. This will reduce the amount of data passed at a time.

pyspark count distinct on each column

1 Answers1