Spark's dataframe count() function taking very long

Question

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

My question: Why is Spark's implementation of count() is slow. Is there a work-around?

score 8 · Accepted Answer · answered Aug 24 '17 at 10:45

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

Spark's dataframe count() function taking very long

1 Answers1

Linked