PySpark getting distinct values over a wide range of columns

Question

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.

------------------------------------------------ 
| columnname | value                 | count   |
|------------|-----------------------|---------|
| evar1      | en-GB                 | 7654321 |
| evar1      | en-US                 | 1234567 |
| evar2      | www.myclient.com      |     123 |
| evar2      | app.myclient.com      |     456 |
| ...

The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.

i = 1
df_evars = None
while i <= 30:
  colname = "evar" + str(i)
  df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
    .withColumn("colName", fn.lit(colname))
  if df_evars:
    df_evars = df_evars.union(df_temp)
  else:
    df_evars = df_temp
display(df_evars)

Am I missing a better solution?

Update

This has been marked as a duplicate but the two responses IMO only solve part of my question.

I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.

The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.

The second response seems less relevant than the first. To clarify, source data like this:

----------------------- 
| evar1 | evar2 | ... |
|---------------|-----|
| A     | A     | ... |
| B     | A     | ... |
| B     | B     | ... |
| B     | B     | ... |
| ...

Should result in the output

--------------------------------
| columnname | value | count   |
|------------|-------|---------|
| evar1      | A     | 1       |
| evar1      | B     | 3       |
| evar2      | A     | 2       |
| evar2      | B     | 2       |
| ...

Alper t. Turker · Accepted Answer · 2017-08-19T13:18:28.800

2

Using melt borrowed from here:

from pyspark.sql.functions import col

melt(
    df.select([col(c).cast("string") for c in df.columns]), 
    id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()

Adapted from the answer by user6910411.

edited Aug 19 '17 at 13:18

answered Aug 18 '17 at 11:02

Alper t. Turker

34,230
9
83
115

PySpark getting distinct values over a wide range of columns

1 Answers1