This is a question about a map datastructure which stays memory efficient under heavy non-destructive manipulation.
Context
I am writing a little program whereby I generate "states" (which are just terms holding data) in a predicate iterate/2
according to some logic that can be disregarded for the purpose of the present discussion.
iterate/2
recursively invokes itself in a tail-recursive manner. At each invocation, iterate/2
generates more states and has to efficiently check whether any state has previously been seen.
The number of states "previously seen" can be considerable (a few tens of thousands). However, we are allowed to "forget about" (a fat percentage of) previously seen entries from time to time. This is a good thing as we can hope to keep overall memory requirements within bounds.
For efficient execution of a "previously seen" predicate, a "map" interface backed by an appropriate implementation seems to be a good choice (because one can implement lookup in maps efficiently).
Suppose we have such a map datastructure.
- For any newly generated state, compute a hash value (by whatever means) from said state, and insert the pair (hash,state) as (key,value) into the map.
- To check whether a state has already been previously seen, compute its hash value and check whether an entry corresponding to that hash value exists in the map. (For this exercise, we disregard the possibility of hash collisions)
The ever-different map datastructure is then passed between invocations of iterate
.
This being a logic programming language, we want to sustain the general idea of working in the timeless mathematical space of all possible by-definition-immutable structures, where we move from one structure to the next like a monkey swinging from a vine, instead of continually mutating a time-varying data structure as we do in imperative programming. We just hope that the garbage collector is faster than we are!
The map-related predicates that we offer could be:
map_new(?Map)
...to unify Map with the empty map.
map_get(+Map,+Key,?Val)
...to unify Val with the value associated to key Key in map Map or fail if Map contains no such Key (by allowing Key to be a variable, one could implement enumeration via backtracking).
map_put(+Map,+Key,?Val,?OldVal,?NewMap)
...to unify map NewMap with the map obtained by inserting the (Key,Val) pair into map Map. If map Map already contains an entry with key Key, the value of said entry is unified with OldVal.
map_rem(+Map,+Key,?OldVal,?NewMap)
...to unify map NewMap with the map obtained by removing the entry with key Key from map Map. The value associated to key Key in map Map is unified with OldVal. Fails if map Map does not contain key Key.
And we would use the above like this:
iterate(Solution) :-
map_new(Map),
iterate(Map,Solution).
iterate(M,S) :-
gen_fresh_states(ListFresh),
gen_stale_states(ListStale),
add_fresh_states(M,ListFresh,MM),
del_stale_states(MM,ListStale,MMM),
( termination_check(MMM,S) -> true ; iterate(MMM,S) ).
Where
add_fresh_states(M,ListFresh,MM),
del_stale_states(MM,ListStale,MMM),
iterate over the lists-of-states ListFresh and ListStale in the expected way, iteratively calling map_put/5
or map_rem/4
.
And -> ;
is the ->/2 predicate.
Problem
At this point, we need to choose a good implementation for the map. We do not want a structure that changes excessively whenever a single item is removed or added, so that the underlying implementation can keep copy operations and generally heap space low by referencing substructures (e.g. subtrees) of map M
whenever MM
is constructed from it during a map_put/5
or map_rem/4
.
(For example, i hear that Clojure uses efficient implementations in that sense for some of its 4[immutable maps], but I haven't looked into this closely.)
In SWI Prolog, we have AVL-tree-backed implementation of assoc. This is excellent for fast lookup. Unfortunately, AVL trees may change heavily whenever a tree node is inserted or removed. There may well not be much in common between the actual on-heap layout of maps M
, MM
, MMM
, or any map in between: The heap is likely to fill up rapidly. The implementation of assoc
is in Prolog btw. (/usr/lib64/swipl-7.2.3/library/assoc.pl
), there is no special sauce that goes into it.
Am I mistaken in thinking AVL tree-based maps won't be up to the task?
Alternatively, are there any map implementations (not necessarily in SWI Prolog) that are able to reuse substructure efficiently?
I suppose one can already alleviate heap usage by using "!" in between the subgoals of iterate/2
. This will tell the runtime that backtracking will not occur, and thus gives it license to destroy any on-heap structures that won't ever be used again. For example, a "!" after the call to del_state_states/3
potentially makes MM
eligible for garbage collection (I have no idea whether that really works though).