Reinforcement learning is a technique, common in computer science, in which a computer system learns how best to solve some problem through trial-and-error. Classic applications of reinforcement learning involve problems as diverse as robot navigation, network administration, and automated surveillance.
At the Association for Uncertainty in Artificial Intelligence’s annual conference this summer, researchers from MIT’s Laboratory for Information and Decision Systems (LIDS) and Computer Science and Artificial Intelligence Laboratory will present a new reinforcement-learning algorithm that, for a wide range of problems, allows computer systems to find solutions much more efficiently than previous algorithms did.
The paper also represents the first application of a new programming framework that the researchers developed, which makes it much easier to set up and run reinforcement-learning experiments. Alborz Geramifard, a LIDS postdoc and first author of the new paper, hopes that the software, dubbed RLPy (for reinforcement learning and Python, the programming language it uses), will allow researchers to more efficiently test new algorithms and compare algorithms’ performance on different tasks. It could also be a useful tool for teaching computer-science students about the principles of reinforcement learning.
Geramifard developed RLPy with Robert Klein, a master’s student in MIT’s Department of Aeronautics and Astronautics. RLPy and its source code were both released online in April.
Every reinforcement-learning experiment involves what’s called an agent, which in artificial-intelligence research is often a computer system being trained to perform some task. The agent might be a robot learning to navigate its environment, or a software agent learning how to automatically manage a computer network. The agent has reliable information about the current state of some system: The robot might know where it is in a room, while the network administrator might know which computers in the network are operational and which have shut down. But there’s some information the agent is missing — what obstacles the room contains, for instance, or how computational tasks are divided up among the computers.
Finally, the experiment involves a “reward function,” a quantitative measure of the progress the agent is making on its task. That measure could be positive or negative: The network administrator, for instance, could be rewarded for every failed computer it gets up and running but penalized for every computer that goes down.
The goal of the experiment is for the agent to learn a set of policies that will maximize its reward, given any state of the system. Part of that process is to evaluate each new policy over as many states as possible. But exhaustively canvassing all of the system’s states could be prohibitively time-consuming.
Consider, for instance, the network-administration problem. Suppose that the administrator has observed that in several cases, rebooting just a few computers restored the whole network. Is that a generally applicable solution?