I'm using Spark with Java connector to process my data.
One of the essential operations I need to do with the data is to count the number of records (row) within a data frame.
I tried df.count()
but the execution time is extremely slow (30-40 seconds for 2-3M records).
Also, due to the system's requirement, I don't want to use df.rdd().countApprox()
API because we need the exact count number.
Could somebody give me a suggestion of any alternatives that return exactly the same result as df.count()
does, with faster execution time?
Highly appreciate your replies.