What's a simple way to design a memoization system with limited memory?

Question

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:

A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.

These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:

A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).

The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?

Sorry if this is too vague, or off-topic.

You may be better advised if you'll question is more specific. There's a ton of literature of cache design, but non will help unless you are more specific. And no kidding, if you are running out of memory, why not do the computations whit capable hardware enough? Nowadays that's very cheap compared to man hours burnt for designing proper cache behavior. Thanks — eat, Feb 17 '11 at 01:32
uh, yeah, thanks. I thought I was asking for a simple distillation of that vast body of literature into a few small rules of thumb. Re: computer cost versus man-hours, you can talk to my boss about buying hardware, or you can pay-pal me the money yourself... — shabbychef, Feb 17 '11 at 01:36

score 2 · Answer 1 · answered Feb 17 '11 at 02:15

I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.

Here's the basic design idea:

The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.

There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.

In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.

Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.

I was more interested in the logic of dropping older cached objects based on costliness and usefulness. The part about having get and set methods is not difficult. In fact, I think I should not have mentioned the language at all. — shabbychef, Feb 17 '11 at 03:34

score 1 · Answer 2 · edited May 23 '17 at 11:55

1

Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).

LRU containers are available for Java:

edited May 23 '17 at 11:55

Community

1
1

answered Feb 17 '11 at 13:18

Jason S

184,598
164
608
970

we are usually running with `-nojvm` option, because of memory restrictions, so Java is not an option. In reality, I should not have mentioned the language at all. The pointer to the acronym 'LRU' has been helpful at least for my google-fu. – shabbychef Feb 17 '11 at 19:12

What's a simple way to design a memoization system with limited memory?

2 Answers2

Linked