44

I'm using @functools.lru_cache in Python 3.3. I would like to save the cache to a file, in order to restore it when the program will be restarted. How could I do?

Edit 1 Possible solution: We need to pickle any sort of callable

Problem pickling __closure__:

_pickle.PicklingError: Can't pickle <class 'cell'>: attribute lookup builtins.cell failed

If I try to restore the function without it, I get:

TypeError: arg 5 (closure) must be tuple
Community
  • 1
  • 1
Francesco Frassinelli
  • 3,145
  • 2
  • 31
  • 43
  • 4
    Note that I think the LRU cache implementation is going to be replaced by a C implementation in Python 3.4 or 3.5, any attempt at extracting the cache contents is probably not going to be future-proof. – Martijn Pieters Mar 23 '13 at 10:53
  • 2
    Just avoid `lru_cache`. Is it important for your function to have an `lru_cache` or a simple cache is enough? Otherwise you can re-implement the `lru_cache` and add the functionality you want. – Bakuriu Mar 23 '13 at 11:22
  • @Bakuriu: a simple cache is enough. I've found lru_cache and I was asking myself if it's possible to save its status. – Francesco Frassinelli Mar 23 '13 at 11:29

7 Answers7

41

You can't do what you want using lru_cache, since it doesn't provide an API to access the cache, and it might be rewritten in C in future releases. If you really want to save the cache you have to use a different solution that gives you access to the cache.

It's simple enough to write a cache yourself. For example:

from functools import wraps

def cached(func):
    func.cache = {}
    @wraps(func)
    def wrapper(*args):
        try:
            return func.cache[args]
        except KeyError:
            func.cache[args] = result = func(*args)
            return result   
    return wrapper

You can then apply it as a decorator:

>>> @cached
... def fibonacci(n):
...     if n < 2:
...             return n
...     return fibonacci(n-1) + fibonacci(n-2)
... 
>>> fibonacci(100)
354224848179261915075L

And retrieve the cache:

>>> fibonacci.cache
{(32,): 2178309, (23,): 28657, ... }

You can then pickle/unpickle the cache as you please and load it with:

fibonacci.cache = pickle.load(cache_file_object)

I found a feature request in python's issue tracker to add dumps/loads to lru_cache, but it wasn't accepted/implemented. Maybe in the future it will be possible to have built-in support for these operations via lru_cache.

vidstige
  • 12,492
  • 9
  • 66
  • 110
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • 2
    Thanks for the code, I'll try it, I think that it could be a good solution. I'm the creator of the feature request ;) Look at the date. – Francesco Frassinelli Mar 23 '13 at 13:46
  • 2
    Depending on the exact use case, it might be worth building the cache with [shelves](https://docs.python.org/3.5/library/shelve.html), which are basically persistent dicts. – Michael Mauderer Mar 02 '16 at 09:27
  • This actually does not work! You can save the cache to disk with pickle with this – but loading them as stated does not work. – Nudin Feb 15 '18 at 16:02
  • 3
    @Nudin Do you mean that setting `fibonacci.cache` did not work? Yeah, it should have been `fibonacci.__wrapped__.cache = ...` I changed slightly the decorator and now should work as intended. – Bakuriu Feb 15 '18 at 19:29
  • 2
    This isn't an LRU cache, it will grow indefinitely. – c z May 26 '20 at 15:42
  • Anyway to put a limit on it? – Mansour.M Oct 01 '20 at 20:00
  • for my purposes I used a `collections.deque(maxlen=n)` and then just `append` for the updating step (the leftmost is automatically popped) – cards Feb 10 '23 at 20:08
  • ...and I assigned the function's attribute `cache` to the decorator itself and not to the target function `func` – cards Feb 10 '23 at 20:16
12

You can use a library of mine, mezmorize

import random
from mezmorize import Cache

cache = Cache(CACHE_TYPE='filesystem', CACHE_DIR='cache')


@cache.memoize()
def add(a, b):
    return a + b + random.randrange(0, 1000)

>>> add(2, 5)
727
>>> add(2, 5)
727
reubano
  • 5,087
  • 1
  • 42
  • 41
7

Consider using joblib.Memory for persistent caching to the disk.

Since the disk is enormous, there's no need for an LRU caching scheme.

Niko Föhr
  • 28,336
  • 10
  • 93
  • 96
Will
  • 4,241
  • 4
  • 39
  • 48
1

You are not supposed to touch anything inside the decorator implementation except for the public API so if you want to change its behavior you probably need to copy its implementation and add necessary functions yourself. Note that the cache is currently stored as a circular doubly linked list so you will need to take care when saving and loading it.

wRAR
  • 25,009
  • 4
  • 84
  • 97
0

This is something that I wrote might be helpful devcache.

It's designed to help you speed up iterations for long running methods. It's configurable with a config file

@devcache(group='crm')
def my_method(a, b, c):  
    ...        

@devcache(group='db')
def another_method(a, b, c): 
    ...        

The cache can be refreshed or used with a yaml config file like:

refresh: false # refresh true will ignore use_cache and refresh all cached data 
props:
    1:
        group: crm
        use_cache: false
    2:
        group: db
        use_cache: true

Would refresh the cache for my_method and use the cache for another_method.

It's not going to help you pickle the the callable but it does the caching part and would be straight forward to modify the code to add specialized serialization.

pcauthorn
  • 370
  • 1
  • 8
  • 1
    `devcache` seems an interesting library which uses sqlite as cache, with extra conf, to be used for more complex cased. If that is the case, wouldn't it be better to rely on memcached or Redis, which are born to do that? It could be interesting to do such comparison in your README.md file. Note: if a query in your database takes minutes, there could be something wrong, or you could be evaluating using materialized views. – Francesco Frassinelli May 16 '21 at 14:13
  • Thanks for checking it out. Good point on Redis, having the data storage as configurable including memcached or Redis is the approach I see for this. If there is a need I'll put that on the road map. The main value add of `devcache` is to be able to selectively invalidate the cache and choose parameters to include in the cache key. Totally agree on time to get data from the database. The database I'm working on has a pretty rough data model and it's not in the scope of the project to rework it or add views. – pcauthorn May 17 '21 at 12:45
0

If your use-case is to cache the result of computationally intensive functions in your pytest test suites, pytest already has a file-based cache. See the docs for more info.

This being said, I had a few extra requirements:

  1. I wanted to be able to call the cached function directly in the test instead of from a fixture
  2. I wanted to cache complex python objects, not just simple python primitives/containers
  3. I wanted an implementation that could refresh the cache intelligently (or be forced to invalidate only a single key)

Thus I came up with my own wrapper for the pytest cache, which you can find below. The implementation is fully documented, but if you need more info let me know and I'll be happy to edit this answer :)

Enjoy:

from base64 import b64encode, b64decode
import hashlib
import inspect
import pickle
from typing import Any, Optional

import pytest

__all__ = ['cached']

@pytest.fixture
def cached(request):
    def _cached(func: callable, *args, _invalidate_cache: bool = False, _refresh_key: Optional[Any] = None, **kwargs):
        """Caches the result of func(*args, **kwargs) cross-testrun.
        Cache invalidation can be performed by passing _invalidate_cache=True or a _refresh_key can
        be passed for improved control on invalidation policy.

        For example, given a function that executes a side effect such as querying a database:

            result = query(sql)
        
        can be cached as follows:

            refresh_key = query(sql=fast_refresh_sql)
            result = cached(query, sql=slow_or_expensive_sql, _refresh_key=refresh_key)

        or can be directly invalidated if you are doing rapid iteration of your test:

            result = cached(query, sql=sql, _invalidate_cache=True)
        
        Args:
            func (callable): Callable that will be called
            _invalidate_cache (bool, optional): Whether or not to invalidate_cache. Defaults to False.
            _refresh_key (Optional[Any], optional): Refresh key to provide a programmatic way to invalidate cache. Defaults to None.
            *args: Positional args to pass to func
            **kwargs: Keyword args to pass to func

        Returns:
            _type_: _description_
        """
        # get debug info
        # see https://stackoverflow.com/a/24439444/4442749
        try:
            func_name = getattr(func, '__name__', repr(func))
        except:
            func_name = '<function>'
        try:
            caller = inspect.getframeinfo(inspect.stack()[1][0])
        except:
            func_name = '<file>:<lineno>'
        
        call_key = _create_call_key(func, None, *args, **kwargs)

        cached_value = request.config.cache.get(call_key, {"refresh_key": None, "value": None})
        value = cached_value["value"]

        current_refresh_key = str(b64encode(pickle.dumps(_refresh_key)), encoding='utf8')
        cached_refresh_key = cached_value.get("refresh_key")

        if (
            _invalidate_cache # force invalidate
            or cached_refresh_key is None # first time caching this call
            or current_refresh_key != cached_refresh_key # refresh_key has changed
        ):
            print("Cache invalidated for '%s' @ %s:%d" % (func_name, caller.filename, caller.lineno))
            result = func(*args, **kwargs)
            value = str(b64encode(pickle.dumps(result)), encoding='utf8')
            request.config.cache.set(
                key=call_key,
                value={
                    "refresh_key": current_refresh_key,
                    "value": value
                }
            )
        else:
            print("Cache hit for '%s' @ %s:%d" % (func_name, caller.filename, caller.lineno))
            result = pickle.loads(b64decode(bytes(value, encoding='utf8')))
        return result
    return _cached

_args_marker = object()
_kwargs_marker = object()

def _create_call_key(func: callable, refresh_key: Any, *args, **kwargs):
    """Produces a hex hash str of the call func(*args, **kwargs)"""
    # producing a key from func + args
    # see https://stackoverflow.com/a/10220908/4442749
    call_key = pickle.dumps(
        (func, refresh_key) +
        (_args_marker, ) +
        tuple(args) +
        (_kwargs_marker,) +
        tuple(sorted(kwargs.items()))
    )
    # create a hex digest of the key for the filename
    m = hashlib.sha256()
    m.update(bytes(call_key))
    return m.digest().hex()
    
Philippe Hebert
  • 1,616
  • 2
  • 24
  • 51
0

After many years, I reply to my own question, to add another possible solution, using Walrus and Redis:

db = Database()
cache = db.cache()

@cache.cached(timeout=60)
def add_numbers(a, b):
    return a + b

Reference: https://walrus.readthedocs.io/en/latest/api.html#walrus.Cache.cached

This allows to take advantage of Redis features, such as specify when and how to delete data, by having a custom redis.conf file.

Here is an example of configuration, which drops the least frequently used keys after more than 150 MB are used:

maxmemory 150mb
maxmemory-policy volatile-lfu
Francesco Frassinelli
  • 3,145
  • 2
  • 31
  • 43