The reward importance

The impact of the reward choice

We will see how important it is to set a proper reward, playing with two hyperparameters, ie:

The results of a well-chosen reward can lead to significant improvements ! Our bot, only trained for conquering the map as fast as possible, now systematically wins against the trained bot (that applies heuristics). This is best exemplified by the 3 games below.

Devising the reward

The reward is complex to devise since we take multiple actions at each turn, and we have to compute the reward for each of these individual actions.

Below is an insightful illustration to understand the process. The number written on each square are the reward associated with the current action of the square. Notice that, at each turn, these reward are different for each square, and that, when a square is about to conquer another adjacent square, the reward for its action is high.

It is even more higher as the square is more productive.

HINT: highly productive square have a brighter background, and the poorly productive have a darker one.

Observe how the rewards evolve over time: there is already a discount factor applied, because we encourage (/reward) action that will eventually lead to a reward over time. Indeed, the STILL squares are earning rewards !


Understanding the discount factor

To better understand the discount factor, let’s push it to its limits, and look at the corresponding reward for the exact same game.

As expected, these reward strategies fare badly compared to a more balanced discount factor. See below the comparison.

Variation of the raw reward

Each reward is computed according to the production of the conquered square, and then “backtracked” to the actions that lead to this reward.

But should this reward be proportional to the production ? Wouldn’t it be better to make it proportional to the square of the production ? Or even to a higher power ?

Indeed, we want to strongly encourage our bot to conquer highly productive square, and a way to enforce efficiently this learning is by giving significantly greater rewards for the highly productive square.

All the example before had reward proportional to the power 4 of the production. But let’s look for a power 2 and a linear reward.

The ratio changes

Let’s extract one frame of the above. (see gifs below) Let’s not focus on the absolute value of the rewards, but rather on the ratio between the rewards of different actions.

The two actions that we compare here are:

We would want action (1) to be better reward than action (2). Indeed look at the background color of the conquered square. The conquered square in (1) is brighter than the conquered square in (2) and therefore more productive.

In all cases, reward(1) > reward(2). But if we look at the ratio (see gifs below), we have, from left to right:

Which illustrates that, the higher the exponent for the reward, the greater the difference between the reward of good and very good actions.

The performance

According to the choice of reward, the training can be much slower, or even converge to a worse equilibrium. We should keep this in mind as we explore new strategies in the future.


performance


Scaling up

What about the results on a larger map ?

Our trained Bot still wins all the games against the OpponentBot when we increase the map size.

However, we notice that: