화학공학소재연구정보센터
IEEE Transactions on Automatic Control, Vol.65, No.8, 3663-3670, 2020
Renewal Monte Carlo: Renewal Theory-Based Reinforcement Learning
An online reinforcement learning algorithm called renewal Monte Carlo (RMC) is presented. RMC works for infinite horizon Markov decision processes with a designated start state. RMC is a Monte Carlo algorithm that retains the key advantages of Monte Carlo-viz., simplicity, ease of implementation, and low bias-while circumventing the main drawbacks of Monte Carlo-viz., high variance and delayed updates. Given a parameterized policy pi(theta), the algorithm consists of three parts: estimating the expected discounted reward R-theta and the expected discounted time T-theta over a regenerative cycle; estimating the derivatives del R-theta(theta) and del T-theta(theta); and updating the policy parameters using stochastic approximation to find the roots of R-theta del T-theta(theta) - T-theta del R-theta(theta). It is shown that under mild technical conditions, RMC converges to a locally optimal policy. It is also shown that RMC works for postdecision state models as well. An approximate version of RMC is proposed where a regenerative cycle is defined as successive visits to a prespecified "renewal set". It is shown that if the value function of the system is locally Lipschitz on the renewal set, then RMC converges to an approximate locally optimal policy. Three numerical experiments are presented to illustrate RMC and compare it with other state-of-the-art reinforcement learning algorithms.