I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive.
I would call myself an experienced coder, but with big data I am still fresh.
When working on smaller datasets I will often test parts of my code to see if it runs properly, e.g.:
df[df[col1] == 5]
However, with big data such as this, filtering jobs can take many minutes to run.
Something that I am noticing is that the run-time seems to increase as I continue my transformations in the notebook, even after massively reducing the size of the table.
Is there some kind of cache that needs to be emptied as I go along with coding within the notebook script? Or do I just have to live with long run-times when dealing with sizes such as these?
I don't want to start increasing the size of my computing cluster if I can reduce run-time by simply improve my code.
I realize that this question is quite broad, but any tips or tricks would be greatly appreciated.