How do I reduce the run-time for Big Data PySpark scripts?

Question

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive. I would call myself an experienced coder, but with big data I am still fresh.
When working on smaller datasets I will often test parts of my code to see if it runs properly, e.g.:

df[df[col1] == 5]

However, with big data such as this, filtering jobs can take many minutes to run.
Something that I am noticing is that the run-time seems to increase as I continue my transformations in the notebook, even after massively reducing the size of the table.

Is there some kind of cache that needs to be emptied as I go along with coding within the notebook script? Or do I just have to live with long run-times when dealing with sizes such as these?
I don't want to start increasing the size of my computing cluster if I can reduce run-time by simply improve my code.

I realize that this question is quite broad, but any tips or tricks would be greatly appreciated.

This question is indeed too broad for SO. I'd recommend providing more technical details (e.g. how is data stored, is it a delta lake table?) and focus on a single problem (e.g. how do I get faster results when filtering/doing x on my dataset?) — ScootCork, Dec 17 '22 at 14:36

score 0 · Answer 1 · answered Dec 19 '22 at 14:25

It's likely in your col1 == 5 example it has to do a complete scan of every row in the table to find the (possibly single) row with a value of 5.

If you don't need a precise value for testing, you can use .limit(), which will efficiently take only the first rows the database happens to come across.

Likewise, if you know there's only a single col1 == 5 row, df.select(F.col('col1') == 5).limit(1) will tell Spark to stop searching the table once it finds your row, which is going to be a minor win most of the time.

.sample() might also help with total runtime while still testing a meaningful subset of the table (put on your Central Limit Theorem hat!).

cache(), persist() and checkpoint() are also all useful, and explained in this question: What is the difference between spark checkpoint and persist to a disk

How do I reduce the run-time for Big Data PySpark scripts?

1 Answers1