The reward importance

The impact of the reward choice

We will see how important it is to set a proper reward, playing with two hyperparameters, ie:

The discount factor - for the discounted rewards
The reward expression, as a function of the production - the hyperparameter being the function itself

The results of a well-chosen reward can lead to significant improvements ! Our bot, only trained for conquering the map as fast as possible, now systematically wins against the trained bot (that applies heuristics). This is best exemplified by the 3 games below.

Devising the reward

The reward is complex to devise since we take multiple actions at each turn, and we have to compute the reward for each of these individual actions.

Below is an insightful illustration to understand the process. The number written on each square are the reward associated with the current action of the square. Notice that, at each turn, these reward are different for each square, and that, when a square is about to conquer another adjacent square, the reward for its action is high.

It is even more higher as the square is more productive.

HINT: highly productive square have a brighter background, and the poorly productive have a darker one.

Observe how the rewards evolve over time: there is already a discount factor applied, because we encourage (/reward) action that will eventually lead to a reward over time. Indeed, the STILL squares are earning rewards !

Discount = 0.6

Understanding the discount factor

To better understand the discount factor, let’s push it to its limits, and look at the corresponding reward for the exact same game.

On the left, notice that when the discount factor is set to 0, only the moves that conquer a square qre rewarded. This means that the STILL action for a square never gets rewarded - which is undesirable.
On the over end, with a discount rate of 0.9, the rewards tend to be overall much higher. Yet this excessively uniform pattern doesn’t favor much the actions that are actually good. Too many actions are rewarded, even though they were potentially not efficient.

As expected, these reward strategies fare badly compared to a more balanced discount factor. See below the comparison.

Discount = 0.0
Discount = 0.9

Variation of the raw reward

Each reward is computed according to the production of the conquered square, and then “backtracked” to the actions that lead to this reward.

But should this reward be proportional to the production ? Wouldn’t it be better to make it proportional to the square of the production ? Or even to a higher power ?

Indeed, we want to strongly encourage our bot to conquer highly productive square, and a way to enforce efficiently this learning is by giving significantly greater rewards for the highly productive square.

All the example before had reward proportional to the power 4 of the production. But let’s look for a power 2 and a linear reward.

Power: 2 (Discount = 0.6)
Power: 1 (Discount = 0.6)

The ratio changes

Let’s extract one frame of the above. (see gifs below) Let’s not focus on the absolute value of the rewards, but rather on the ratio between the rewards of different actions.

The two actions that we compare here are:

The square on the top left that conquers its left neighbour (1)
The square on the bottom right that conquers its above neighbour (2)

We would want action (1) to be better reward than action (2). Indeed look at the background color of the conquered square. The conquered square in (1) is brighter than the conquered square in (2) and therefore more productive.

In all cases, reward(1) > reward(2). But if we look at the ratio (see gifs below), we have, from left to right:

0.65/0.24 = 2.7
0.93/0.49 = 1.8
1.1/0.7 = 1.5

Which illustrates that, the higher the exponent for the reward, the greater the difference between the reward of good and very good actions.

Power: 4 (D = 0.6)
Power: 2 (D = 0.6)
Power: 1 (D = 0.6)

The performance

According to the choice of reward, the training can be much slower, or even converge to a worse equilibrium. We should keep this in mind as we explore new strategies in the future.

performance

Scaling up

What about the results on a larger map ?

Our trained Bot still wins all the games against the OpponentBot when we increase the map size.

However, we notice that:

This solution is too long to compute for each square individually
- Maybe we should only apply it for the squares on the border (and find another strategy for the squares in the center)
- We could gain time if we made only one call to the tensorflow session. Besides, the extraction of the local game states would probably be faster on the tensorflow side.
Squares in the middle have a suboptimal behaviour - seems like they tend to move to the left systematically.

Halite Challenge

A fork of the Halite Starting Kit, aimed at providing an interface and debugging tools and for RL strategies (reinforcement learning).