0

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive. I would call myself an experienced coder, but with big data I am still fresh.
When working on smaller datasets I will often test parts of my code to see if it runs properly, e.g.:

df[df[col1] == 5]

However, with big data such as this, filtering jobs can take many minutes to run.
Something that I am noticing is that the run-time seems to increase as I continue my transformations in the notebook, even after massively reducing the size of the table.

Is there some kind of cache that needs to be emptied as I go along with coding within the notebook script? Or do I just have to live with long run-times when dealing with sizes such as these?
I don't want to start increasing the size of my computing cluster if I can reduce run-time by simply improve my code.

I realize that this question is quite broad, but any tips or tricks would be greatly appreciated.

  • This question is indeed too broad for SO. I'd recommend providing more technical details (e.g. how is data stored, is it a delta lake table?) and focus on a single problem (e.g. how do I get faster results when filtering/doing x on my dataset?) – ScootCork Dec 17 '22 at 14:36

1 Answers1

0

It's likely in your col1 == 5 example it has to do a complete scan of every row in the table to find the (possibly single) row with a value of 5.

If you don't need a precise value for testing, you can use .limit(), which will efficiently take only the first rows the database happens to come across.

Likewise, if you know there's only a single col1 == 5 row, df.select(F.col('col1') == 5).limit(1) will tell Spark to stop searching the table once it finds your row, which is going to be a minor win most of the time.

.sample() might also help with total runtime while still testing a meaningful subset of the table (put on your Central Limit Theorem hat!).

cache(), persist() and checkpoint() are also all useful, and explained in this question: What is the difference between spark checkpoint and persist to a disk

ldrg
  • 4,150
  • 4
  • 43
  • 52