0

I have a pyspark dataframe with some columns. I want to count the occurrence of each word for each column of the dataframe. I can count the word using the group by query, but I need to figure out how to get this detail for each column using only a single query. I have attached a sample data frame for reference and expected output.

Following Query which I am using to get the count but it works only on a particular column: DF.groupBy('ColumnName').count()

I appreciate your input on this.
Sample Input dataframe:

enter image description here

Expected Output:
enter image description here

venus
  • 1,188
  • 9
  • 18
  • Does that thread answer your question - https://stackoverflow.com/questions/37949494/how-to-count-occurrences-of-each-distinct-value-for-every-column-in-a-dataframe ? – Bartosz Gajda Nov 22 '22 at 20:48

1 Answers1

0

Data

df =spark.createDataFrame([
('1'  ,  'null' ,        ''),
('1'   , ''  ,       'null'),
('1'    ,'0' ,       '0'),
('1'  ,  '1' ,        'null'),
('1'   , '1' ,        '0'),
('null'   ,'1' ,       '0'),
(''   , '1' ,        ''),
('0'  , '1' ,        '1'),
(''   , '1' ,       '1')],
('Ratings',  'Vote',  'PointsGiven'))

Approach

  1. create aan array of struct of each column by combining the column name and value
  2. Use inline command to explode into individual columns
  3. groupby pivote and count

Code

df.withColumn('tab', F.array(*[F.struct(lit(x).alias('g'), col(x).alias('v')).alias(x) for x in df.columns])).selectExpr('inline(tab)').groupby('g').pivot('v').agg(count('v')).show()

Outcome

+-----------+---+---+---+----+
|          g|   |  0|  1|null|
+-----------+---+---+---+----+
|       Vote|  1|  1|  6|   1|
|PointsGiven|  2|  3|  2|   2|
|    Ratings|  2|  1|  5|   1|
+-----------+---+---+---+----+
wwnde
  • 26,119
  • 6
  • 18
  • 32