8

I am trying to devise an iterative markov decision process (MDP) agent in Python with the following characteristics:

  • observable state
    • I handle potential 'unknown' state by reserving some state space for answering query-type moves made by the DP (the state at t+1 will identify the previous query [or zero if previous move was not a query] as well as the embedded result vector) this space is padded with 0s to a fixed length to keep the state frame aligned regardless of query answered (whose data lengths may vary)
  • actions that may not always be available at all states
  • reward function may change over time
  • policy convergence should incremental and only computed per move

So the basic idea is the MDP should make its best guess optimized move at T using its current probability model (and since its probabilistic the move it makes is expectedly stochastic implying possible randomness), couple the new input state at T+1 with the reward from previous move at T and reevaluate the model. The convergence must not be permanent since the reward may modulate or the available actions could change.

What I'd like to know is if there are any current python libraries (preferably cross-platform as I necessarily change environments between Windoze and Linux) that can do this sort of thing already (or may support it with suitable customization eg: derived class support that allows redefining say reward method with one's own).

I'm finding information about on-line per-move MDP learning is rather scarce. Most use of MDP that I can find seems to focus on solving the entire policy as a preprocessing step.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
Brian Jack
  • 468
  • 4
  • 11
  • This isn't Python-specific, but I found [a research paper on a technique for an On-line MDP](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.8186) (the PDF link is on the right, under "Cached"). It may be interesting to look over, although I'm not sure if it can fulfill your goal. – voithos Feb 05 '12 at 02:42
  • Incremental/Decremental (ie: online) SVR techniques are also starting to appear in academia however no Python libraries for these as of yet. – Brian Jack Feb 15 '13 at 09:29
  • poMDP may mitigate the need for specialized information gathering moves... The whole 0-padded answer channel in the state space feels like a hack. Though that would allow a reward function to reward it asking the right questions... so I'm unsure about this still... – Brian Jack Feb 15 '13 at 09:33
  • I could also use a recurrent LSTM network to estimate the reward (as a regression problem) over time based on time series input from the environment. – Brian Jack Jul 22 '13 at 18:54

1 Answers1

1

Here is a python toolbox for MDPs.

Caveat: It's for vanilla textbook MDPs and not for partially observable MDPs (POMDPs), or any kind of non-stationarity in rewards.

Second Caveat: I found the documentation to be really lacking. You have to look in the python code if you want to know what it implements or you can quickly look at their documentation for a similar toolbox they have for MATLAB.

kitchenette
  • 1,625
  • 11
  • 12