3

I am learning data science and use Jupyter Notebook to do the work. I have already done a lot of data preprocessing and training work. But I realize that each time I shut down the notebook and want to continue the work the next day, I have to run all the cells. From the first to the cell I stopping last time. It wastes my time because it takes a long time to run all the cells again. I believe there must a better way to that. Because I load the data, process the data, and fit the machine learning model, it does not make sense to start over each time. However, I didn't find the answer. Can anybody let me know how to do this?

I have just heard of Dill, it saves variables, does it save the ML model and when reopen Jupyter, is that exactly the same as when you shut down it?

user10262232
  • 71
  • 1
  • 9
  • Possible duplicate of [How to pickle or store Jupyter (IPython) notebook session for later](https://stackoverflow.com/questions/34342155/how-to-pickle-or-store-jupyter-ipython-notebook-session-for-later) – jpp Oct 08 '18 at 17:08

1 Answers1

0

There is no way of saving the state of the whole Jupyter notebook. All variables are stored in memory. Thus, when you shut down the notebook, everything is lost.

What you can do is explicitly save intermediate steps:

  • For data processing, use pd.write_csv(df) once you have your final dataset, so that you don't have to preprocess the data. When opening the notebook, check that the file exists, and load it into a new dataframe if it does.
  • After training the model, save it using the pickle library (check the first comment in your original question). And load the trained model when opening the notebook.

If you do that, you don't have to re-run all the heavy tasks every time you shut down and open the notebook again.

Hope that helps, cheers!

guillemch
  • 323
  • 2
  • 14
  • Yes. This works when you are able to create the model in the first sitting itself. Sadly, my work is more about exploratory data analysis. And yes, I can save the output of each analysis that I do for use later, however, everytime I have to read the csv (input data) to run any new analysis on it (considering I will work on the data multiple times across sessions - so shutdown and the readcsv is lost). And that itself is a pain as the files are about 1 GB and takes forever to read. – Meet Mar 27 '21 at 05:25