I am currently reading the Dynamic Programming & MDP of Ronald Howard. Particularly in page 29 he presents the toymaker example with two different policies 1 and 2.Each policy has a transition probability matrix and a reward matrix.
# Set up initial variables for Policy 1
P1 <- matrix(c(0.5, 0.5, 0.4, 0.6), nrow = 2, byrow = TRUE);rownames(P1)=c("1","2");P1
R1 <- matrix(c(9, 3, 3, -7), nrow = 2, byrow = TRUE);rownames(R1)=c("1","2");R1
# Set up initial variables for Policy 2
P2 <- matrix(c(0.8, 0.2, 0.7, 0.3), nrow = 2, byrow = TRUE);rownames(P2)=c("1","2");P2
R2 <- matrix(c(4, 4, 1, -19), nrow = 2, byrow = TRUE);rownames(R2)=c("1","2");R2
Now bind these 4 matrices according to their policies:
library(dplyr)
P = as_tibble(rbind(P1,P2))%>%
mutate(policy = rep(c(1,2),2))%>%
arrange(policy)%>%
rename(p1 =V1,p2 = V2);P
R = as_tibble(rbind(R1,R2))%>%
mutate(policy2 = rep(c(1,2),2))%>%
arrange(policy2)%>%
rename(r1 =V1,r2 = V2);R
cbind(P,R)%>%
select(-policy2)%>%
relocate(.,policy,.after = r2)%>%
group_by(policy)
the result is :
# A tibble: 4 × 5
# Groups: policy [2]
p1 p2 r1 r2 policy
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.5 0.5 9 3 1
2 0.8 0.2 4 4 1
3 0.4 0.6 3 -7 2
4 0.7 0.3 1 -19 2
the q_{I}^k = \sum_{j=1}^{N}p_{ij}^k r_{ij}^k
for k =1,2, where k are the two policies.Sctually is the row sum of element by element P*R matrices for each policy.
I want to implement the sequential value iteration as described in his book.In general I want from this data frame to calculate the 3.3 equation as shown in the picture below and the ideal output to be the table as shown below.
How can I do it in R ?
Edit (additional note)
for one policy as described in page 19 is the following
# Set up initial variables
> P = matrix(c(0.5, 0.5, 0.4, 0.6), nrow = 2, byrow = TRUE)
> R = matrix(c(9, 3, 3, -7), nrow = 2, byrow = TRUE)
> Q = rowSums(P*R)
> n = 5
> v = matrix(0, nrow = 2, ncol = n+1) # Initialize v(0) to be all zeros
> v[,1] = 0 # Initialize v(0) to be all zeros
> # total-value vector.
> # Calculate values for different n using matrix calculations
> for (i in 1:n) {
+ v[,i+1] = Q + P %*% v[,i]
+ }
> print(v)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 6 7.5 8.55 9.555 10.5555
[2,] 0 -3 -2.4 -1.44 -0.444 0.5556