memoize to disk - python - persistent memoization

Question

Is there a way to memoize the output of a function to disk?

I have a function

def getHtmlOfUrl(url):
    ... # expensive computation

and would like to do something like:

def getHtmlMemoized(url) = memoizeToFile(getHtmlOfUrl, "file.dat")

and then call getHtmlMemoized(url), so as to do the expensive computation only once for each url.

thanks but I am a python newbie (second day). I don't have the slightest idea what you mean... — seguso, May 09 '13 at 14:04
Rather than trying to reinvent this wheel, here's a library that does that quite well, and is robust against all sorts of corner cases you won't anticipate until much later (concurrency, disk usage, thundering herd): https://bitbucket.org/zzzeek/dogpile.cache — SingleNegationElimination, May 09 '13 at 15:07
You can use the library redis-simple-cache which does exactly this, persistent memoisation of function calls. Check it out : https://github.com/vivekn/redis-simple-cache — abc def foo bar, Oct 17 '13 at 12:37
For anyone reading this in the future, the better redis cache to use now is https://pypi.org/project/python-redis-cache/ — JakeCowton, Apr 15 '20 at 10:57

score 56 · Accepted Answer · edited May 23 '17 at 11:46

56

Python offers a very elegant way to do this - decorators. Basically, a decorator is a function that wraps another function to provide additional functionality without changing the function source code. Your decorator can be written like this:

import json

def persist_to_file(file_name):

    def decorator(original_func):

        try:
            cache = json.load(open(file_name, 'r'))
        except (IOError, ValueError):
            cache = {}

        def new_func(param):
            if param not in cache:
                cache[param] = original_func(param)
                json.dump(cache, open(file_name, 'w'))
            return cache[param]

        return new_func

    return decorator

Once you've got that, 'decorate' the function using @-syntax and you're ready.

@persist_to_file('cache.dat')
def html_of_url(url):
    your function code...

Note that this decorator is intentionally simplified and may not work for every situation, for example, when the source function accepts or returns data that cannot be json-serialized.

More on decorators: How to make a chain of function decorators?

And here's how to make the decorator save the cache just once, at exit time:

import json, atexit

def persist_to_file(file_name):

    try:
        cache = json.load(open(file_name, 'r'))
    except (IOError, ValueError):
        cache = {}

    atexit.register(lambda: json.dump(cache, open(file_name, 'w')))

    def decorator(func):
        def new_func(param):
            if param not in cache:
                cache[param] = func(param)
            return cache[param]
        return new_func

    return decorator

edited May 23 '17 at 11:46

Community

1
1

answered May 09 '13 at 14:44

georg

211,518
52
313
390

8

This will write a new file every time the cache is updated - dependent on the use case this may (or may not) defeat the speedup you get from memoization.... – root May 09 '13 at 14:58
5

it *also* contains a very nice race condition, if this decorator is used concurrently, or (more likely), in a re-entrant fashion. if `a()` and `b()` are both memoized, and `a()` calls `b()`, the cache can be read for `a()`, and then again for `b()`, first b's result is memoized, but then the stale cache from the call to a overwrites it, b's contribution to the cache is lost. – SingleNegationElimination May 09 '13 at 15:03
1

@root: sure, `atexit` would be perhaps a better place for flushing the cache. On the other side, adding premature optimizations might defeat the educational purpose of this code. – georg May 09 '13 at 15:25
1

surely in the general case it'd be better to pickle rather than serialize to json. also the stdlib contains this https://docs.python.org/2/library/shelve.html which wraps the pickling in a dict style interface... pretty much a readymade cache system, just add decorator – Anentropic Jan 31 '15 at 16:14
I made a package called [simpler](https://pypi.org/project/simpler/) with an annotation `simpler.files.disk_cache` that fits exactly this purpose, including a cache lifetime and argument-aware caching using annotation parameters. – Carlos Roldán Dec 09 '20 at 15:46
is there a way to use this decorator inside an instance method ? – Pablo Nov 13 '21 at 20:00
This won't work if the function that you are memoizing has multiple arguments. A quick fix is to change the `def new_func` line to `def new_func(*args, **kwargs):`;`param=str([args, kwargs])` and then you need to change the call to `func`, too. – Eyal Jan 28 '22 at 16:28
I found this very useful. I'll just add that if you want to have the ability to clear the cache or add some metadata to it (such as data attribution), you can bind the related dict methods for `cache` to `new_func` within the decorator, such as `new_func.clear_cache = cache.clear` and `new_func.update_cache = cache.update` immediately before the `return new_func` line. Then say your wrapped function is named `expensive_function(args)`, you can clear the cache with `expensive_function.clear_cache()`. This is similar to how functools adds the clear_cache option. – nigh_anxiety Jan 19 '23 at 01:21

score 53 · Answer 2 · edited Sep 10 '21 at 09:11

53

Check out joblib.Memory. It's a library for doing exactly that.

from joblib import Memory
memory = Memory("cachedir")
@memory.cache
def f(x):
    print('Running f(%s)' % x)
    return x

edited Sep 10 '21 at 09:11

Colonel Panic

132,665
89
401
465

answered May 07 '14 at 19:33

Will

4,241
4
39
48

2

pros: Works for me, and it's super fast. cons: The folder names are letters and numbers like `3e2bbf56be556090e768d65967cb9122`, and the outputs are `.pkl` files so I can't easily see what the cached data looks like. – mareoraft Sep 02 '21 at 21:44

score 28 · Answer 3 · edited Oct 30 '21 at 18:42

28

A cleaner solution powered by Python's Shelve module. The advantage is the cache gets updated in real time via well-known dict syntax, also it's exception proof(no need to handle annoying KeyError).

import shelve
def shelve_it(file_name):
    d = shelve.open(file_name)

    def decorator(func):
        def new_func(param):
            if param not in d:
                d[param] = func(param)
            return d[param]

        return new_func

    return decorator

@shelve_it('cache.shelve')
def expensive_funcion(param):
    pass

This will facilitate the function to be computed just once. Next subsequent calls will return the stored result.

edited Oct 30 '21 at 18:42

Zitrax

19,036
20
88
110

answered Nov 20 '17 at 06:07

nehem

12,775
6
58
84

How do you make it multi-parameter? – RogerS Aug 29 '22 at 22:35
@RogerS change the decorator so that it accepts *args **kwargs. – nehem Aug 31 '22 at 01:41
As implemented here, this solution doesn't appear to work in 3.9, unless param happens to be a string; If `param` is any other data type it will throw an AttributeError due to not having `encode`. You'll also get collisions if you have multiple functions using the decorator with similar parameters. I did get this to work with modifications. Add `writeback=True`. Set `key = f"{func.__name__}", if key not in d, set `d[key] == {}`, then `func_cache = d[key]`. Then replace references to `d` inside `new_func` with `func_cache`. I also declared `d` as global, and then `d.sync()` before exit – nigh_anxiety Dec 30 '22 at 21:29
Correcting myself a bit here, there won't be any parameter collision for multiple functions as each function gets it's own cache file, but the problem with needing to convert param to str() still stands. – nigh_anxiety Dec 30 '22 at 22:13

score 24 · Answer 4 · answered Nov 20 '19 at 06:23

24

There is also diskcache.

from diskcache import Cache

cache = Cache("cachedir")

@cache.memoize()
def f(x, y):
    print('Running f({}, {})'.format(x, y))
    return x, y

answered Nov 20 '19 at 06:23

C. Yduqoli

1,706
14
18

3

if the function to be memoized is a method of an object, and the `cache` variable is init-ed in that object, how can this be employed ? – Ciprian Tomoiagă Dec 21 '20 at 14:33
It's simple to use and works well. It stores all the cached data in a file called `cache.db` within the cache dir you specify. – mareoraft Sep 02 '21 at 21:47

Peter · Answer 5 · 2017-12-21T17:32:13.227

5

The Artemis library has a module for this. (you'll need to pip install artemis-ml)

You decorate your function:

from artemis.fileman.disk_memoize import memoize_to_disk

@memoize_to_disk
def fcn(a, b, c = None):
    results = ...
    return results

Internally, it makes a hash out of input arguments and saves memo-files by this hash.

edited Dec 21 '17 at 17:32

answered Jun 29 '15 at 12:33

Peter

12,274
9
71
86

How fast is it compared to `shelve`? – Gulzar Sep 17 '22 at 09:37

score 3 · Answer 6 · answered Jan 25 '22 at 09:03

Check out Cachier. It supports additional cache configuration parameters like TTL etc.

Simple example:

from cachier import cachier
import datetime

@cachier(stale_after=datetime.timedelta(days=3))
def foo(arg1, arg2):
  """foo now has a persistent cache, trigerring recalculation for values stored more than 3 days."""
  return {'arg1': arg1, 'arg2': arg2}

root · Answer 7 · 2013-05-09T14:35:47.767

0

Something like this should do:

import json

class Memoize(object):
    def __init__(self, func):
        self.func = func
        self.memo = {}

    def load_memo(filename):
        with open(filename) as f:
            self.memo.update(json.load(f))

    def save_memo(filename):
        with open(filename, 'w') as f:
            json.dump(self.memo, f)

    def __call__(self, *args):
        if not args in self.memo:
            self.memo[args] = self.func(*args)
        return self.memo[args]

Basic usage:

your_mem_func = Memoize(your_func)
your_mem_func.load_memo('yourdata.json')
#  do your stuff with your_mem_func

If you want to write your "cache" to a file after using it -- to be loaded again in the future:

your_mem_func.save_memo('yournewdata.json')

edited May 09 '13 at 14:35

answered May 09 '13 at 14:12

root

76,608
25
108
120

1

@seguso - updated on the usage. More on memoization: http://stackoverflow.com/questions/1988804/what-is-memoization-and-how-can-i-use-it-in-python – root May 09 '13 at 14:31
@Merlin - This is a classic case of memoization where a class should (or could as well) be used... – root May 09 '13 at 14:45
Thanks but this is not what I mean by memoize. I don't want to manually manage loading and saving to disk. This should be transparent. – seguso May 09 '13 at 14:55
@seguso - then go with *thg435* variant, but note that it will write to a file every time the cache is updated - that may end up defeating the purpose of memoization. (in this case you can manage the writes, while you still need to load the file only once) – root May 09 '13 at 15:00

score 0 · Answer 8 · answered Mar 06 '17 at 12:02

Assuming that you data is json serializable, this code should work

import os, json

def json_file(fname):
    def decorator(function):
        def wrapper(*args, **kwargs):
            if os.path.isfile(fname):
                with open(fname, 'r') as f:
                    ret = json.load(f)
            else:
                with open(fname, 'w') as f:
                    ret = function(*args, **kwargs)
                    json.dump(ret, f)
            return ret
        return wrapper
    return decorator

decorate getHtmlOfUrl and then simply call it, if it had been run previously, you will get your cached data.

Checked with python 2.x and python 3.x

score 0 · Answer 9 · answered Jul 09 '18 at 17:20

You can use the cache_to_disk package:

    from cache_to_disk import cache_to_disk

    @cache_to_disk(3)
    def my_func(a, b, c, d=None):
        results = ...
        return results

This will cache the results for 3 days, specific to the arguments a, b, c and d. The results are stored in a pickle file on your machine, and unpickled and returned next time the function is called. After 3 days, the pickle file is deleted until the function is re-run. The function will be re-run whenever the function is called with new arguments. More info here: https://github.com/sarenehan/cache_to_disk

Jason · Answer 10 · 2020-07-17T08:07:42.413

Most answers are in a decorator fashion. But maybe I don't want to cache the result every time when calling the function.

I made one solution using context manager, so the function can be called as

with DiskCacher('cache_id', myfunc) as myfunc2:
    res=myfunc2(...)

when you need the caching functionality.

The 'cache_id' string is used to distinguish data files, which are named [calling_script]_[cache_id].dat. So if you are doing this in a loop, will need to incorporate the looping variable into this cache_id, otherwise data will be overwritten.

Alternatively:

myfunc2=DiskCacher('cache_id')(myfunc)
res=myfunc2(...)

Alternatively (this is probably not quite useful as the same id is used all time time):

@DiskCacher('cache_id')
def myfunc(*args):
    ...

The complete code with examples (I'm using pickle to save/load, but can be changed to whatever save/read methods. NOTE that this is also assuming the function in question returns only 1 return value):

from __future__ import print_function
import sys, os
import functools

def formFilename(folder, varid):
    '''Compose abspath for cache file

    Args:
        folder (str): cache folder path.
        varid (str): variable id to form file name and used as variable id.
    Returns:
        abpath (str): abspath for cache file, which is using the <folder>
            as folder. The file name is the format:
                [script_file]_[varid].dat
    '''
    script_file=os.path.splitext(sys.argv[0])[0]
    name='[%s]_[%s].nc' %(script_file, varid)
    abpath=os.path.join(folder, name)

    return abpath


def readCache(folder, varid, verbose=True):
    '''Read cached data

    Args:
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    Returns:
        results (tuple): a tuple containing data read in from cached file(s).
    '''
    import pickle
    abpath_in=formFilename(folder, varid)
    if os.path.exists(abpath_in):
        if verbose:
            print('\n# <readCache>: Read in variable', varid,
                    'from disk cache:\n', abpath_in)
        with open(abpath_in, 'rb') as fin:
            results=pickle.load(fin)

    return results


def writeCache(results, folder, varid, verbose=True):
    '''Write data to disk cache

    Args:
        results (tuple): a tuple containing data read to cache.
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    '''
    import pickle
    abpath_out=formFilename(folder, varid)
    if verbose:
        print('\n# <writeCache>: Saving output to:\n',abpath_out)
    with open(abpath_out, 'wb') as fout:
        pickle.dump(results, fout)

    return


class DiskCacher(object):
    def __init__(self, varid, func=None, folder=None, overwrite=False,
            verbose=True):
        '''Disk cache context manager

        Args:
            varid (str): string id used to save cache.
                function <func> is assumed to return only 1 return value.
        Keyword Args:
            func (callable): function object whose return values are to be
                cached.
            folder (str or None): cache folder path. If None, use a default.
            overwrite (bool): whether to force a new computation or not.
            verbose (bool): whether to print some text info.
        '''

        if folder is None:
            self.folder='/tmp/cache/'
        else:
            self.folder=folder

        self.func=func
        self.varid=varid
        self.overwrite=overwrite
        self.verbose=verbose

    def __enter__(self):
        if self.func is None:
            raise Exception("Need to provide a callable function to __init__() when used as context manager.")

        return _Cache2Disk(self.func, self.varid, self.folder,
                self.overwrite, self.verbose)

    def __exit__(self, type, value, traceback):
        return

    def __call__(self, func=None):
        _func=func or self.func
        return _Cache2Disk(_func, self.varid, self.folder, self.overwrite,
                self.verbose)



def _Cache2Disk(func, varid, folder, overwrite, verbose):
    '''Inner decorator function

    Args:
        func (callable): function object whose return values are to be
            cached.
        varid (str): variable id.
        folder (str): cache folder path.
        overwrite (bool): whether to force a new computation or not.
        verbose (bool): whether to print some text info.
    Returns:
        decorated function: if cache exists, the function is <readCache>
            which will read cached data from disk. If needs to recompute,
            the function is wrapped that the return values are saved to disk
            before returning.
    '''

    def decorator_func(func):
        abpath_in=formFilename(folder, varid)

        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            if os.path.exists(abpath_in) and not overwrite:
                results=readCache(folder, varid, verbose)
            else:
                results=func(*args, **kwargs)
                if not os.path.exists(folder):
                    os.makedirs(folder)
                writeCache(results, folder, varid, verbose)
            return results
        return wrapper

    return decorator_func(func)



if __name__=='__main__':

    data=range(10)  # dummy data

    #--------------Use as context manager--------------
    def func1(data, n):
        '''dummy function'''
        results=[i*n for i in data]
        return results

    print('\n### Context manager, 1st time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 2nd time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 3rd time call with overwrite=True')
    with DiskCacher('context_mananger', func1, overwrite=True) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    #--------------Return a new function--------------
    def func2(data, n):
        results=[i*n for i in data]
        return results

    print('\n### Wrap a new function, 1st time call')
    func2b=DiskCacher('new_func')(func2)
    res=func2b(data, 10)
    print('res =', res)

    print('\n### Wrap a new function, 2nd time call')
    res=func2b(data, 10)
    print('res =', res)

    #----Decorate a function using the syntax sugar----
    @DiskCacher('pie_dec')
    def func3(data, n):
        results=[i*n for i in data]
        return results

    print('\n### pie decorator, 1st time call')
    res=func3(data, 10)
    print('res =', res)

    print('\n### pie decorator, 2nd time call.')
    res=func3(data, 10)
    print('res =', res)

The outputs:

### Context manager, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 2nd time call

# <readCache>: Read in variable context_mananger from disk cache:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 3rd time call with overwrite=True

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 2nd time call

# <readCache>: Read in variable new_func from disk cache:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 2nd time call.

# <readCache>: Read in variable pie_dec from disk cache:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

diamondsea · Answer 11 · 2023-01-05T17:57:15.563

Here's a solution I came up with which can:

memoize mutable objects (memoized functions should have no side effects that change mutable parameters or it won't work as expected)
writes to a separate cache file for each wrapped function (easy to delete the file to purge that particular cache)
compresses the data to make it much smaller on disk (a LOT smaller)

It will create cache files like:

cache.__main__.function.getApiCall.db
cache.myModule.function.fixDateFormat.db
cache.myOtherModule.function.getOtherApiCall.db

Here's the code. You can choose a compression library of your choosing, but I've found LZMA works best for the pickle storage we are using.

import dbm
import hashlib
import pickle
# import bz2
import lzma

# COMPRESSION = bz2
COMPRESSION = lzma # better with pickle compression

# Create a @memoize_to_disk decorator to cache a memoize to disk cache
def memoize_to_disk(function, cache_filename=None):
    uniqueFunctionSignature = f'cache.{function.__module__}.{function.__class__.__name__}.{function.__name__}'
    if cache_filename is None:
        cache_filename = uniqueFunctionSignature
        # print(f'Caching to {cache_file}')
    def wrapper(*args, **kwargs):
        # Convert the dictionary into a JSON object (can't memoize mutable fields, this gives us an immutable, hashable function signature)
        if cache_filename == uniqueFunctionSignature:
            # Cache file is function-specific, so don't include function name in params
            params = {'args': args, 'kwargs': kwargs} 
        else:
            # add module.class.function name to params so no collisions occur if user overrides cache_file with the same cache for multiple functions
            params = {'function': uniqueFunctionSignature, 'args': args, 'kwargs': kwargs}

        # key hash of the json representation of the function signature (to avoid immutable dictionary errors)
        params_json = json.dumps(params)  
        key = hashlib.sha256(params_json.encode("utf-8")).hexdigest()  # store hash of key
        # Get cache entry or create it if not found
        with dbm.open(cache_filename, 'c') as db:
            # Try to retrieve the result from the cache
            try:
                result = pickle.loads(COMPRESSION.decompress(db[key]))
                # print(f'CACHE HIT: Found {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
                return result
            except KeyError:
                # If the result is not in the cache, call the function and store the result
                result = function(*args, **kwargs)
                db[key] = COMPRESSION.compress(pickle.dumps(result))
                # print(f'CACHE MISS: Stored {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
                return result
    return wrapper

To use the code, use the @memoize_to_disk decorator (with an optional filename parameter if you don't like "cache." as a prefix)

@memoize_to_disk
def expensive_example(n):
  // expensive operation goes here
  return value

memoize to disk - python - persistent memoization

11 Answers11

Linked