Without knowing more about your use case and resources it is hard to give you a definitive answer. However it is most likely negative, despite the fact that Spark will access the source twice.
Overall there are multiple factors to consider:
- How much data will be loaded in the first pass. With efficient on-disk input format Spark (like Parquet) Spark has no need to fully load dataset at all. This also applies to a number of other input formats, including but not limited to JDBC reader.
- Cost of caching the data with
Dataset
API is quite high (that's why default storage mode is MEMORY_AND_DISK
) and can easily exceed the cost of loading the data.
- Impact on the subsequent processing. In general caching will interfere with partition pruning, predicate pushdown and projections (see Any performance issues forcing eager evaluation using count in spark?).
So...
Does the .count means the df will be computed twice?
To some extent depending on the input format. Plain text formats (like JSON or CSV) will require more repeated work than binary sources.
do I need to cache the original dataframe
Typically not, unless you know that cost of fetching data from the storage justifies disadvantages listed above.
Making the final decision requires a good understanding of your pipeline (primarily how data will be processed downstream and how caching affects this) and metrics you want to optimize (latency, total execution time, monetary cost of running required resources).
You should also consider alternatives to count
, like processing InputMetrics
or using Accumulators
.