Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:
- the notebook quickly becomes too complex and messy to be maintained and improved further as notebook, and I have to make python scripts out of it;
- when it comes to production code (e.g. one that needs to be re-run every day), the notebook again is not the best format.
Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:
Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.
- '+': quick
- '-': dirty, non-flexible, not convenient to maintain
Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via
argparse
.- '+': more flexible usage; more readable code (if you properly transformed the pipeline logic to functions)
- '-': oftentimes, the pipeline is NOT splittable into logically completed pieces that could become functions without any quirks in the code. All these functions are typically needed to be only called once in the script rather than to be called many times inside loops, maps etc. Furthermore, each function typically takes the output of all functions called before, so one has to pass many arguments to each function.
The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.
- '+': you needn't to pass many arguments to each method -- all the previous outputs already stored as attributes
- '-': the overall logic of a task is still not captured -- it is data and machine learning pipeline, not just class. The only goal for the class is to be created, call all the methods sequentially one-by-one and then be removed. On top of this, classes are quite long to implement.
Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.
I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.
Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?