Expansion at the border
As detailed in the previous blog articles, we train jointly the individual agents at the border of the map. As you can see below, we obtain an agent that perform significantly well at small scale. Indeed, it has learnt to conquer the highly productive squares (the bright ones) in priority.
To assess the confidence of our agent, we can look at the entropy of the learnt policy. For convenience, we implemented a interface that displays the softmax probabilities at time t as you click on the agent. We can see the 5 NORTH-EAST-SOUTH-WEST-STILL probabilities associated with each move, and how the agent greedily selects them.
The Dijsktra algorithm
The power of Dijsktra
We had dealt with the problem of border squares, learning with a neural network.
The Dijsktra algorithm, which runs here in linear time, gives us the ability to handle the squares in the middle of the map:
Now, only the borders’s behaviour is determined by our trained policy. We adopt a deterministic strategy for the interior of the map.