Reinforcement Learning (RL) is not a newcomer. Actually, it has been applied to problems of a very different nature for quite some time. However, it has only recently attracted interest as a tool for tackling combinatorial problems which, in their most complex version, represent real challenges for many companies: Routing, maintenance activity planning (especially if they are predictive), production scheduling, etc.
The question is what does it offer that classical prescriptive analytics approaches do not? What new applications may exist as a result of RL model development?
According to Deepmind (2021), Reinforcement Learning is “the science and framework of learning to make decisions from interaction”. This definition points to two fundamental aspects.
Firstly, the discipline is oriented towards learning how to make decisions. That is, the aim is not just to make better decisions but to build a way of learning how to make those decisions. In this way, the developer of Reinforcement Learning models is not the one who decides by means of using a model, but the one who builds a model that can learn how to make that decision.
Secondly, the source of learning is interaction. Unlike other approaches in which there is a need for some kind of formal modelling of the context in which decisions must be made, this is not necessary in the case of Reinforcement Learning, but through the different actions that are carried out and with the information received it will be possible to know that context and to be able to choose the best action as a result of learning through interaction.
Both of the above considerations are essential for complex problems requiring short response times. Classical prescriptive analytics approaches require some kind of formal representation of the system and require an algorithm to be run each time new data becomes available. The problem is that such algorithms generally have runtimes longer than would be desirable to get the decision in place.
However, RL learns by “trying”, mind you, by trying intelligently. And after “testing a lot” it can deliver all the distilled learning, typically in the form of a neural network. The neural network is a condensed form of what in RL is called policy, which receives the input data and within a negligible amount of time proposes the next “best action”. That is, we do not have to wait for an algorithm to explore the possible possibilities, discarding the uninteresting ones and offering a good or, ideally, the best one.
A policy in a routing problem is able to propose the next customer to be visited based on the remaining customers to be visited, the traffic status and the probability of successful delivery based on the time of arrival.
Figure 1 depicts the essential aspects of the dynamics of a Reinforcement Learning system. Very briefly, an agent interacts with the environment. In each interaction, the agent performs an action. As a result of that action, the environment returns a reward to the agent and a change occurs in the environment, whose new state is the starting state from which the agent selects a new action and the process continues in the same way.
The essential elements of a Reinforcement Learning problem are as follows:
· Agent. The agent represents the decision-maker who selects actions and who, based on the information he receives from the environment in the form of rewards, is able to learn which actions are the most convenient to achieve the highest possible value from the rewards accumulated over time.
· Action. Represents the agent’s decision that has some kind of impact on the environment. Each problem has its own action space, which in turn may depend on the state of the environment.
· Environment. The environment is everything that is outside the agent’s control and whose state depends, at least in part, on the agent’s actions. The environment can have deterministic or non-deterministic behaviour. In the first case, when for a given environment the agent selects an action, the new state is always the same. In the second case, the possible new state that the agent reaches is not always the same.
· State and observation. The state is the characterisation of the system at a given point in time. Depending on the type of problem, the environment may or may not be “fully observable”. If it is, what the agent observes about the environment is all the information about its state. If not, only partial information is available to the agent. In many cases, the state is referred to even when the environment is not fully observable.
· Policy. The policy is the essential element that governs the agent’s behaviour. A policy is a function that allows an action to be selected from the set of actions in the action space. Policies can be deterministic (given a state, the agent always selects the same action) or stochastic (given a state, there is a set of possible actions that the agent can select each with a given probability).
· Reward. In each interaction, the agent receives a reward depending on the new state of the environment. The selection of the reward is an essential element for the agent to reproduce the desired behaviour.
· Return. The return is the cumulative sum of the rewards that the agent receives over time.
For each problem, each of the above elements has to be characterised or constructed and the ultimate goal is to have a policy (usually a trained neural network).
Characterising the neural network structure is no small task and requires expertise. Fortunately, there is a growing body of knowledge in the field of Deep Learning that applies to Reinforcement Learning.
In addition to the technical challenge of selecting the network structure, there are two other major problems.
The first is that the agent has to learn, and to do so it needs to interact with the environment. Obviously, it is not a good option for the agent to learn in such a way that it makes decisions that are implemented in the real system and it learns from that. Before the agent would have learned anything, the transport company would have gone bankrupt. The good news is that simulation allows complex environments to be represented with enough fidelity for the agent to learn by interacting with a simulation model developed to represent the system under study. In addition to being economically feasible (routing poorly for several months is much more expensive than building a simulation model), it is much faster to learn because you can replicate the behaviour of the system on a machine, exploring thousands of days in a few minutes.
The second is training. Even if we have a simulation model and a neural network structure at the heart of an RL model, the training time and the resources it can consume are not few. Technology is also advancing in this sense and we now have Tensor Processing Units (specially designed to work with tensors, which are the backbone of the information in a neural network). But the investment is worth it. The time and money spent on having an agent that has “learned” allows us to make decisions much more quickly (in real-time) so that we can operate systems that require it much more accurately.
Do you have any questions about Reinforcement Learning?