0

I am refactoring a coworker's Pyspark notebook and I noticed every paragraph starts with the following line:

if df.count() > 0:
     ...<remainder of paragraph>...

where df is a constant dataframe generated at the beginning of the notebook. The function of the line is essentially to prevent errors in calculation if there is no actual data present.

This seems needlessly costly to me but I'm wondering how the expression is actually calculated on compilation. I know pyspark can be quite clever, so I'm looking for resources that might indicate if pyspark knows how to take shortcuts here.

for instance, if I were a dumb compiler I would count millions of entries every single time the .count() method is called and then see if that number is greater than 1 but if I were a very smart compiler maybe I would just count UNTIL I reach 1 then break and return true, or perhaps I could cache the result so I don't count the exact same dataframe 30 times in a row.

I plan on using something like df_populated = bool(df.head(1)) to access throughout the notebook. I'm certain this is more efficient but I'd like to know how much better this result would be given the complexity of the pyspark interpreter.

Both methods work well. Time trials indicate that my solution is faster, but I am looking for an actual mechanism here and not just empirical results.

  • why not store the result in a variable so you don;t have to rerun the same count? – samkart Nov 11 '22 at 08:07
  • Jonathan - that is exactly what I'm looking for. For those who don't want to scroll through the comments, .count() will always count the entire dataframe so there aren't any shortcuts that may speed up performance. samkart - I think that's also a valid solution. I think it falls in the middle of the road in terms of efficiency since a single count still must be done, whereas head() only looks at the first row in a dataframe so the size of the dataframe will never impact performance. – Jimmy Donovan Nov 25 '22 at 07:35

0 Answers0