Adding Random Exploration

brain_idea.jpg

Last week, we finally built a pipeline to use machine learning on our maze game. We made a Tensor Flow graph that could train a "brain" with weights so we could navigate the maze. This week, we'll see how our training works, or rather how it doesn't work. We'll consider how randomizing moves during training might help.

Our machine learning code lives in this repository. For this article, you'll want to look at the randomize-moves branch. Take a look here for the original game code. You'll want the q-learning branch in the main repo.

This part of the series uses Haskell and Tensor Flow. To learn more about using these together, download our Haskell Tensor Flow Guide!

Unsupervised Machine Learning

With a few tweaks, we can run our game using the new output weights. But what we'll find as we train the weights is that our bot never seems to win! It always seems to do the same thing! It might move up and then get stuck because it can't move up anymore. It might stand still the whole time and let the enemies come grab it. Why would this happen?

Remember that reinforcement learning depends on being able to reinforce good behaviors. Thus at some point, we have to hope our AI will win the game. Then it will get the good reward so that it can change its behavior to adapt and get good results more often. But if it never gets a good result in the whole training process, it will never learn good behaviors!

This is part of the challenge of unsupervised learning. In a supervised learning algorithm, we have specific good examples to learn from. One way to approach this would be to record our own moves of playing the game. Then the AI could learn directly from us! We'll probably try this approach in the future!

But q-learning is an unsupervised algorithm. We're forcing our AI to explore the world and learn for its own. But right now, it's only making moves that it thinks are "optimal." But with a random set of weights, the "optimal" moves aren't very optimal at all! Part of a good "exploration" plan means letting it choose moves from time to time that don't seem optimal.

Adding a Random Choice

As our first attempt to fix this, we'll add a "random move chance" to our training process. At each training step, our network chooses its "best" move, and we use that to update the world state. From now on, whenever we do this, we'll roll the dice. And if we get a number below our random chance, we'll pick a random move instead of our "best" move.

Over the course of training though, we want to decrease this random chance. In theory, our AI should be better as we train the network. So as we get closer to the end of training, we'll want to make fewer random decisions, and more "best" decisions. We'll aim to start this parameter as 1 in 5, and reduce it down to 1 in 50 as training continues. So how do we implement this?

First of all, we want to keep track of a value representing our chance of making a random move. Our runAllIterations function should be stateful in this parameter.

-- Third "Float" parameter is the random chance
runAllIterations :: Model -> World
  -> StateT ([Float, Int, Float) Session ()
...

trainGame :: World -> Session (Vector Float)
trainGame w = do
  model <- buildModel
  let initialRandomChance = 0.2
  (finalReward, finalWinCount, _) <- execStateT
    (runAllIterations model w)
    ([], 0, initialRandomChance)
  run (readValue $ weightsT model)

Then within runAllIterations, we'll make two changes. First, we'll make a new random generator for each training game. Then, we'll update the random chance, reducing it with the number of iterations:

runAllIterations :: Model -> World
  -> StateT ([Float, Int, Float) Session ()
runAllIterations model initialWorld = do
  let numIterations = 2000
  forM [1..numIterations] $ \i -> do
    gen <- liftIO getStdGen
    (wonGame, (_, finalReward, _)) <- runStateT
      (runWorldIteration model)
      (initialWorld, 0.0, gen)
    (prevRewards, prevWinCount, randomChance) <- get
    let modifiedRandomChance = 1.0 / ((fromIntegral i / 40.0) + 5)
    put (newRewards, newWinCount, modifiedRandomChance)
  return ()

Making Random Moves

We can see now that runWorldIteration must now be stateful in the random generator. We'll retrieve that as well as the random chance at the start of the operation:

runWorldIteration :: Model -> StateT (World, Float, StdGen)
  (StateT ([Float], Int, Float) Session) Bool
runWorldIteration model = do
  (prevWorld, prevReward, gen) <- get
  (_, _, randomChance) <- lift get
  ...

Now let's refactor our serialization code a bit. We want to be able to make a new move based on the index, without needing the weights:

moveFromIndex :: Int -> PlayerMove
moveFromIndex bestMoveIndex =
  PlayerMove moveDirection useStun moveDirection
  where
    moveDirection = case bestMoveIndex `mod` 5 of
      0 -> DirectionUp
      1 -> DirectionRight
      2 -> DirectionDown
      3 -> DirectionLeft
      4 -> DirectionNone

Now we can add a function that will run the random generator and give us a random move if it's low enough. Otherwise, it will keep the best move.

chooseMoveWithRandomChance ::
  PlayerMove -> StdGen -> Float -> (PlayerMove, StdGen)
chooseMoveWithRandomChance bestMove gen randomChance =
  let (randVal, gen') = randomR (0.0, 1.0) gen
      (randomIndex, gen'') = randomR (0, 1) gen'
      randomMove = moveFromIndex randomIndex
  in  if randVal < randomChance
        then (randomMove, gen'')
        else (bestMove, gen')

Now it's a simple matter of applying this function, and we're all set!

runWorldIteration :: Model -> StateT (World, Float StdGen)
  (StateT ([Float], Int, Float) Session) Bool
runWorldIteration model = do
  (prevWorld, prevReward, gen) <- get
  (_, _, randomChance) <- lift get
  ...
  let bestMove = ...
  let (newMove, newGen) = chooseMoveWithRandomChance
                            bestMove gen randomChance
  …
  put (nextWorld, prevReward + newReward, newGen)
  continuationAction

Conclusion

When we test our bot, it has a bit more variety in its moves now, but it's still not succeeding. So what do we want to do about this? It's possible that something is wrong with our network or the algorithm. But it's difficult to reveal this when the problem space is difficult. After all, we're expecting this agent to navigate a complex maze AND avoid/stun enemies.

It might help to break this process down a bit. Next week, we'll start looking at simpler examples of mazes. We'll see if our current approach can be effective at navigating an empty grid. Then we'll see if we can take some of the weights we learned and use them as a starting point for harder problems. We'll try to navigate a true maze, and see if we get better weights. Then we'll look at an empty grid with enemies. And so on. This approach will make it more obvious if there are flaws with our machine learning method.

If you've never programmed in Haskell before, it might be a little hard to jump into machine learning. Check out our Beginners Checklist and our Liftoff Series to get started!