0

I have a column name and a dataframe. I want to check if all values in that column are empty and if it is empty drop the column from the dataframe.

What i did was checked the count of the column with non null values and if count equals 0 drop the column but that seems like an expensive operation in pyspark

sks27
  • 63
  • 2
  • 7
  • 1
    Add what you have done else you'll be downvoted soon . – Rex5 Aug 09 '19 at 03:27
  • Added what I tried – sks27 Aug 09 '19 at 03:30
  • See if [this](https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe?rq=1) or [this one](https://stackoverflow.com/questions/37262762/filter-pyspark-dataframe-column-with-none-value) helps. – Rex5 Aug 09 '19 at 03:33

1 Answers1

0

The way you are doing it is the right way. Regarding performance you might want to use caching on your dataframe (if it fits into memory).
Also consider doing the operation on a subset (or even only the first row) of your dataframe first in order to find columns that are definitely not always null. This should reduce the number of columns you have to check on full data

Paul
  • 1,114
  • 8
  • 11