I am refactoring a coworker's Pyspark notebook and I noticed every paragraph starts with the following line:
if df.count() > 0:
...<remainder of paragraph>...
where df
is a constant dataframe generated at the beginning of the notebook. The function of the line is essentially to prevent errors in calculation if there is no actual data present.
This seems needlessly costly to me but I'm wondering how the expression is actually calculated on compilation. I know pyspark can be quite clever, so I'm looking for resources that might indicate if pyspark knows how to take shortcuts here.
for instance, if I were a dumb compiler I would count millions of entries every single time the .count() method is called and then see if that number is greater than 1 but if I were a very smart compiler maybe I would just count UNTIL I reach 1 then break and return true, or perhaps I could cache the result so I don't count the exact same dataframe 30 times in a row.
I plan on using something like df_populated = bool(df.head(1))
to access throughout the notebook. I'm certain this is more efficient but I'd like to know how much better this result would be given the complexity of the pyspark interpreter.
Both methods work well. Time trials indicate that my solution is faster, but I am looking for an actual mechanism here and not just empirical results.