This is my first time working with either Python or Spark, I'm a Java developer. So I don't know how is the best way to solve it here.
I'm working with:
- Spark 2.2.0 built for Hadoop 2.7.3
- Python 2.7.12
I have a PySpark script, this script executes different queries and creates temporary views, until it finally executes a final queries using/joining the different temporary views. It will write files with the result of the final executed query.
The script works fine, but we found out, that when there is no data, it still creates the 200 files (all empty). We wanted to validate that it actually has data before calling the write method or even create the temporary view, so we tried with if df.count() == 0:
, if so raising an error, otherwise, just proceed.
I just added that validation to the final two dataframes, before doing the temporary view, so it interrupts the process as soon as possible, and before executing the next queries.
Then we read somewhere, count is a very expensive method to validate that there is data because it goes through all the executioners , so before even trying it, we changed to something recommended in several places: to use df.take(1)
, df.head(1)
,or df.first(1)
. We went finally with head(1)
.
However, this changed the execution elapsed time from 30 mins to actually more than 1h 40m.
I'd like to know which other way I can avoid spark to write empty files, without increasing that much the computation time.
Since I'm new with all this, I'm opened to suggestion.
Edit
I have already read this thread: How to check if spark dataframe is empty. From this very thread, I took that I should use len(df.head(1)) == 0
, and that increased the computing time from 30 minutes to 1h 40m+.