I am working in a PySpark Shell with Hive data. The goal here is to gather counters for multiple items. Below I have an example dataframe and data query. A resource I used: Is it possible to specify condition in Count()? but it was for limited Counters.
DriveHealth = sqlContext.sql("Select Health From testdrivestatus")
Health |
----------
Healthy
Healthy
Error A
Error B
Error C
The goal here is to create counters that will count:
Number of Drives with Health as Healthy
Number of Drives with Health that is not healthy, so Error A, Error B, and Error C.
Number of Drives with separate health status so a counter for Healthy, Error A, Error B and Error C.
In this case we would have something like this...
Health Counter
--------------
Healthy: 2
Unhealthy: 3
Error A: 1
Error B: 1
Error C: 1
Something I have tried, which works for a small amount of cases but I have over 60 different health statuses, and I was wondering if there was a better way to do this:
DriveHealth = sqlContext.sql("Select
Count( case Health when 'Healthy' then 1 else null end) as Healthy,
Count( case Health is not 'Healthy' then 1 else null end) as UnHealthy,
Count( case Health when 'Error A' then 1 else null end) as ErrorA,
... (skipping typing Through Error C)
From testdrivestatus