I am trying to devise an iterative markov decision process (MDP) agent in Python with the following characteristics:
- observable state
- I handle potential 'unknown' state by reserving some state space for answering query-type moves made by the DP (the state at t+1 will identify the previous query [or zero if previous move was not a query] as well as the embedded result vector) this space is padded with 0s to a fixed length to keep the state frame aligned regardless of query answered (whose data lengths may vary)
- actions that may not always be available at all states
- reward function may change over time
- policy convergence should incremental and only computed per move
So the basic idea is the MDP should make its best guess optimized move at T using its current probability model (and since its probabilistic the move it makes is expectedly stochastic implying possible randomness), couple the new input state at T+1 with the reward from previous move at T and reevaluate the model. The convergence must not be permanent since the reward may modulate or the available actions could change.
What I'd like to know is if there are any current python libraries (preferably cross-platform as I necessarily change environments between Windoze and Linux) that can do this sort of thing already (or may support it with suitable customization eg: derived class support that allows redefining say reward method with one's own).
I'm finding information about on-line per-move MDP learning is rather scarce. Most use of MDP that I can find seems to focus on solving the entire policy as a preprocessing step.