Finding output cells causing large file size in jupyter notebook

Question

I have a jupyter notebook which has ~400 cells. The total file size is 8MB so I'd like to suppress the output cells that have a large size so as to reduce the overall file size.

There are quite a few possible output cells that could be causing this (mainly matplotlib and seaborn plots) so to avoid spending time on trial and error, is there a way of finding the size of each output cell? I'd like to keep as many output plots as possible as I'll be pushing the work to github for others to see.

Do you have reason to expect that some plots are much larger than others? If not, I'm not sure that it makes sense to put effort into searching for the largest ones. — mwaskom, Aug 09 '22 at 10:43
I have a bunch of "standard" plots such as scatter plots and also I have a bunch of large-ish seaborn pairplots (10x10) plus others (such as matrix plots created by the missingno package). My knowledge of figure disk usage is minimal but I am guessing the latter are larger than the former. — Tom B., Aug 09 '22 at 12:55
I managed to reduce it to 2.7MB by commenting out the plotly code (the output of which you cannot see on github anyway) so the issue is pretty much fixed. I shall leave the question up though, in case someone does have an answer to the original question. — Tom B., Aug 09 '22 at 13:42
You could likely use `nbformat` to iterate on the cells in your notebook and check which one had the larger base64 storage, I think. As an aside...You should be using nbviewer to share your GitHub-hosted notebooks in 'static' form and then the Plotly plots would be visible and interactive, see [here](https://stackoverflow.com/a/73297292/8508004). — Wayne, Aug 09 '22 at 21:00

Wayne · Accepted Answer · 2022-08-10T03:07:18.823

My idea with nbformat spelled out for running in a cell in a Jupyter notebook cell to get the code cell numbers listed largest to smallest (it will fetch a notebook example first to have something to try it on):

############### Get test notebook ########################################
import os
notebook_example = "matplotlib3d-scatter-plots.ipynb"
if not os.path.isfile(notebook_example):
    !curl -OL https://raw.githubusercontent.com/fomightez/3Dscatter_plot-binder/master/matplotlib3d-scatter-plots.ipynb
### Use nbformat to get estimate of output size from code cells. #########
import nbformat as nbf
ntbk = nbf.read(notebook_example, nbf.NO_CONVERT)
size_estimate_dict = {}
for cell in ntbk.cells:
    if cell.cell_type == 'code':
        size_estimate_dict[cell.execution_count] = len(str(cell.outputs))
out_size_info = [k for k, v in sorted(size_estimate_dict.items(), key=lambda item: item[1],reverse=True)]
out_size_info

(To have a place to easily run that code go here and click on the launch binder button. When the session spins up, open a new notebook and paste in the code and run it. Static form of the notebook is here.)

Example I tried didn't include Plotly, but it seemed to do similar using a notebook with all Plotly plots. I don't know how it will handle a mix though. It may not sort perfectly if different kinds.
Hopefully, this gives you an idea though how to do what you wondered. The code example could be further expanded to use the retrieved size estimates to have nbformat make a copy of the input notebook without the output showing for, say, the top ten largest code cells.

Finding output cells causing large file size in jupyter notebook

1 Answers1