2

I am working in a PySpark Shell with Hive data. The goal here is to gather counters for multiple items. Below I have an example dataframe and data query. A resource I used: Is it possible to specify condition in Count()? but it was for limited Counters.

DriveHealth = sqlContext.sql("Select Health From testdrivestatus")

Health   |
----------
Healthy  
Healthy
Error A
Error B
Error C

The goal here is to create counters that will count:

  1. Number of Drives with Health as Healthy

  2. Number of Drives with Health that is not healthy, so Error A, Error B, and Error C.

  3. Number of Drives with separate health status so a counter for Healthy, Error A, Error B and Error C.

In this case we would have something like this...

Health Counter
--------------
Healthy: 2
Unhealthy: 3
Error A: 1
Error B: 1
Error C: 1

Something I have tried, which works for a small amount of cases but I have over 60 different health statuses, and I was wondering if there was a better way to do this:

DriveHealth = sqlContext.sql("Select 
Count( case Health when 'Healthy' then 1 else null end) as Healthy,
Count( case Health is not 'Healthy' then 1 else null end) as UnHealthy,
Count( case Health when 'Error A' then 1 else null end) as ErrorA,
... (skipping typing Through Error C)
From testdrivestatus
Cœur
  • 37,241
  • 25
  • 195
  • 267
Travis
  • 657
  • 6
  • 24

1 Answers1

1

What you'll want to do is group by your Health column.

select
    count(*) as total,
    Health
from
    testdrivestatus
group by 
    Health
RToyo
  • 2,877
  • 1
  • 15
  • 22
  • Duh! Dull moment for me. Thank you. I will accept this as soon as I can. – Travis Jun 27 '17 at 15:26
  • How would I add the columns that arent healthy? Would I just have to make a seperate case where it isn't Healthy? – Travis Jun 27 '17 at 16:08
  • @TJ72 I don't know which DB engine you're using, but rule of thumb is that any non-aggregate values (ie selecting a column instead of counting/sum/average/etc) should be in your group. So for example, to add an "ErrorCode" column, you could do `SELECT count(*), Healthy, ErrorCode FROM testdrivestatus GROUP BY Health, ErrorCode` – RToyo Jun 27 '17 at 17:02