22

Environment:

  • Python 3
  • IPython 3.2

Every time I shut down a IPython notebook and re-open it, I have to re-run all the cells. But some cells involve intensive computation.

By contrast, knitr in R save the results in a cache directory by default so only new code and new settings would invoke computation.

I looked at ipycache but it seems to cache a cell instead of the notebook. Is there a counterpart of cache of knitr in IPython?

Zelong
  • 2,476
  • 7
  • 31
  • 51
  • 2
    I don't know if there is such capability in Ipython, but you could simply cache your expensive computations to disk with for instance [joblib.Memory](https://pythonhosted.org/joblib/memory.html). – rth Jul 06 '15 at 22:17

4 Answers4

11

Unfortunately, it doesn't seem like there is something as convenient as an automatic cache. The %store magic option is close, but requires you to do the caching and reloading manually and explicitly.

In your Jupyter notebook:

a = 1
%store a

Now, let's say you close the notebook and the kernel gets restarted. You no longer have access to the local variables. However, you can reload the variables you've stored using the -r option.

%store -r a
print a # Should print 1
viswajithiii
  • 449
  • 4
  • 8
7

In fact the functionality you ask is already there, no need to re-implement it manually by doing your dumps .

You can use the use the %store or maybe better the %%cache magic (extension) to store the results of these intermittently cells, so they don't have to be recomputed (see https://github.com/rossant/ipycache)

It is as simple as:

%load_ext ipycache

Then, in a cell e.g.:

%%cache mycache.pkl var1 var2
var1 = 1
var2 = 2

When you execute this cell the first time, the code is executed, and the variables var1 and var2 are saved in mycache.pkl in the current directory along with the outputs. Rich display outputs are only saved if you use the development version of IPython. When you execute this cell again, the code is skipped, the variables are loaded from the file and injected into the namespace, and the outputs are restored in the notebook.

It saves all graphics, output produced, and all the variables specified automatically for you :)

ntg
  • 12,950
  • 7
  • 74
  • 95
  • Extremely useful, and easier to get working for me than `%autoreload` (my other way of hacking around changing modules but not wanting to reload data ) – ijoseph Jun 15 '18 at 20:59
  • 3
    `ipycache` seems to need a lot of love. Warnings galore, and last update May 2016. – Tom Hale Mar 20 '19 at 07:23
  • 1
    Damn, it used to be low maintenance :S I guess things change as python versions progress... Still have some good memories, and the best solution I've found so far, would be great to find something better/more active – ntg Mar 20 '19 at 10:24
  • What is the difference with `%store`? – BND Apr 17 '19 at 12:26
  • Haven't really used `%store` (but now plan to :) ) It really has been a while I used `ipycache`... If memory holds, It saved all graphics printouts etc. of a cell along with values of variables. Also, you just had to say a cell is cached, and the output variables. It used cached if the cell was not edited from the time of caching, can do multiple variables, can define variables that if changed cache is invalidated etc. It was not perfect but would really like to see something like it again. – ntg Apr 23 '19 at 19:18
  • 3
    `ipycache` is no longer maintained, do you know another tool? – Chris_Rands Feb 04 '20 at 13:22
4

Use the cache magic.

%cache myVar = someSlowCalculation(some, "parameters")

This will calculate someSlowCalculation(some, "parameters") once. And in subsequent calls it restores myVar from storage.

https://pypi.org/project/ipython-cache/

Under the hood it does pretty much the same as the accepted answer.

Community
  • 1
  • 1
wotanii
  • 2,470
  • 20
  • 38
  • 1
    When does a cached variable get invalidated? Ideally it would invalidated when the variables that it depends upon changes but that seems kind clever. – Att Righ Feb 08 '22 at 12:27
  • 1
    per default it changes when the string right of the "=" changes. So it changes when method or its direct parameters change, but it does not look into the methods or the values of the parameters – wotanii Feb 09 '22 at 07:01
  • 1
    Oooh, that sounds likely precisely what I want. – Att Righ Feb 09 '22 at 15:30
  • Hmm there was no cache invalidation when I just tested this. – Att Righ Feb 25 '22 at 09:47
0

Can you give an example of what you are trying to do? When I run something in an IPython Notebook that is expensive I almost always write it to disk afterword. For example, if my data is a list of JSON object, I write it to disk as line separated JSON formatted strings:

with open('path_to_file.json', 'a') as file:
    for item in data: 
        line = json.dumps(item)
        file.write(line + '\n')

You can then read back in the data the same way:

data = []
with open('path_to_file.json', 'a') as file:
    for line in file: 
        data_item = json.loads(line)
        data.append(data_item)

I think this is a good practice generally speaking because it provides you a backup. You can also use pickle for the same thing. If your data is really big you can actually gzip.open to directly write to a zip file.

EDIT

To save a scikit learn model to disk use joblib.pickle.

from sklearn.cluster import KMeans

km = KMeans(n_clusters=num_clusters)
km.fit(some_data)


from sklearn.externals import joblib
# dump to pickle
joblib.dump(km, 'model.pkl')

# and reload from pickle
km = joblib.load('model.pkl')
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
brandomr
  • 192
  • 5
  • I tried applying machine learning models to datasets. For example, I import some data (a few hundred MB) by pandas, and then train and test two models by scikit-learn. I want to "cache" all intermediate transformed DataFrame, as well as the trained models. So I can "carry on" experiments on the intermediate DataFrame without reading from the ground. – Zelong Sep 05 '15 at 09:51
  • @zelong ok, you should use `joblib` to pickle your `sklearn` models. See my edit above. And to write your dataframes to disk just use `dataframe.to_csv('yourfile.csv)` – brandomr Sep 05 '15 at 21:03
  • Thanks a lot. The pickling of scikit-learn model looks quite good. I tried quite a few wrangling with DataFrames and it seems demanding to save a bunch of intermediate DataFrame to csv files. But it seems IPython has not provide a counterpart of `RData` cache, which put everything in a single cube. – Zelong Sep 05 '15 at 21:44
  • I removed the `file.close()` calls, because `with` [closes files for you](https://docs.python.org/3.6/tutorial/inputoutput.html#reading-and-writing-files). – Eric O. Lebigot Oct 25 '17 at 09:42
  • Also: since `data` is a "list", one could more simply do `json.dump(data, file)`, without any loop. And similarly `json.load(file)`. – Eric O. Lebigot Oct 25 '17 at 09:49
  • Finally, naming something `file` is officially not recommended: it clobbers the built-in `file` type. – Eric O. Lebigot Oct 25 '17 at 09:50