We have a very large Pyspark Dataframe, on which we need to perform a groupBy operation.
We've tried with
df_gp=df.groupBy('some_column').count()
and it's taking a really long time (it's been running for more than 17hrs with no result).
I also tried with
df_gp=df.groupBy('some_column').agg(count)
but as far as I can tell the behaviour is the same.
For more context :
- we are running this operation on Zeppelin (Version 0.8.0), using %spark2.pyspark interpreter
- Zeppelin is running on a Yarn client
- Data is stored on Hive (Hive 3.1.0.3.1.0.0-78)
- Initial Dataframe is created by querying Hive with llap :
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
req=""" SELECT *
FROM table
where isodate='2020-07-27'
"""
df = hive.executeQuery(req)
- Dataframe size is ~60 millions rows, 9 columns
- other operations performed on the same Dataframe on the same environment, such as
count()
orcache()
work in under a minute
I've been reading about Spark's groupBy
on different sources, but from what I gathered here, Dataframe API doesn't need to load or shuffle keys in memory, so it should not be a problem even on large Dataframes.
I get that a groupBy
on such a large volume of data can take some time, but this is really too much. I guess there are some memory parameters that maybe need tuning, or is there maybe something wrong with the way we're executing the groupBy operation?
[EDIT] I forgot to mention there are some UDFs being processed on the Dataframe before groupBy
. I've tried :
groupBy
on a large Dataframe, without UDFs : gives result in less than a minutegroupBy
on a sample of the processed Dataframe : same problem as before
So we're thinking the UDFs are the actual cause of the problem, not groupBy