Spark: Count occurrence of each word for each column of a dataframe

Question

I have a pyspark dataframe with some columns. I want to count the occurrence of each word for each column of the dataframe. I can count the word using the group by query, but I need to figure out how to get this detail for each column using only a single query. I have attached a sample data frame for reference and expected output.

Following Query which I am using to get the count but it works only on a particular column: DF.groupBy('ColumnName').count()

I appreciate your input on this.
Sample Input dataframe:

Expected Output:

Does that thread answer your question - https://stackoverflow.com/questions/37949494/how-to-count-occurrences-of-each-distinct-value-for-every-column-in-a-dataframe ? — Bartosz Gajda, Nov 22 '22 at 20:48

score 0 · Answer 1 · answered Nov 22 '22 at 21:40

Data

df =spark.createDataFrame([
('1'  ,  'null' ,        ''),
('1'   , ''  ,       'null'),
('1'    ,'0' ,       '0'),
('1'  ,  '1' ,        'null'),
('1'   , '1' ,        '0'),
('null'   ,'1' ,       '0'),
(''   , '1' ,        ''),
('0'  , '1' ,        '1'),
(''   , '1' ,       '1')],
('Ratings',  'Vote',  'PointsGiven'))

Approach

create aan array of struct of each column by combining the column name and value
Use inline command to explode into individual columns
groupby pivote and count

Code

df.withColumn('tab', F.array(*[F.struct(lit(x).alias('g'), col(x).alias('v')).alias(x) for x in df.columns])).selectExpr('inline(tab)').groupby('g').pivot('v').agg(count('v')).show()

Outcome

+-----------+---+---+---+----+
|          g|   |  0|  1|null|
+-----------+---+---+---+----+
|       Vote|  1|  1|  6|   1|
|PointsGiven|  2|  3|  2|   2|
|    Ratings|  2|  1|  5|   1|
+-----------+---+---+---+----+

Spark: Count occurrence of each word for each column of a dataframe

1 Answers1