Learning to Navigate the Maze!

Dec 20

Previously on this series, we explored how we could adapt our very basic "Breadth First Search" game to be an Open AI Gym "Environment". This week, we'll take the final step and learn what it means to make our environment into a "Learning Environment". Instead of prescribing how our agent moves through the maze, we'll let it learn this for itself using reinforcement learning!

I won't go over every line of code in this particular article, but you can take a look at the full code by checking out this GitHub repository! The code we'll be looking at will be focused in the LearningEnvironment module and MazeLearner. If instead of reading about this you'd like to watch the code in action, take a look at this YouTube video!

A Learning API

Let's recall that we had this function that served as our "game loop". That is, it could take any environment and run through the game's iterations until it finished.

gameLoop ::
  (EnvironmentMonad m) => m (Action m) -> m (Observation m, Reward)
gameLoop chooseAction = do
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then return (newObs, reward)
    else gameLoop chooseAction

We also had a comparable function for a "Renderable" environment that would render the game state with each iteration.

What would it look like, at a high level, for us to make a "learning" loop? That is, what API functions would we want to be available to cause our game agent to learn and improve from iteration to iteration?

I propose we would want at least three elements. First, the "choose action" function should now be an explicit part of the state, rather than a function parameter. Second, we naturally need a "learn" function that takes the observations and rewards and adjusts whatever state we use for choosing the action.

Finally, we should be able to reduce our "exploration rate". For many learning algorithms, we'll want to allow it a chance to "explore" options at first rather than use its own brain. This prevents it from getting stuck in bad habits early on. But we want to reduce the probability of randomness over time so that it can assert the information it has learned.

We'll also want to add an extra layer to our loop. We want to run many iterations of the game over time, rather than a single iteration. After a certain number of iterations, we'll reduce the exploration rate.

Here's a first pass at what these functions might look like. Notice how they rely on our previous environment functions like current observation, stepEnv and resetEnv.

gameLearningLoop = do
  oldObs <- currentObservation
  newAction <- chooseActionBrain
  (newObs, reward, done) <- stepEnv newAction
  learnEnv oldObs newAction newObs reward 
  if done
    then return reward
    else gameLearningLoop

gameLearningIterations = forM [1..numEpisodes] $ \i -> do
  resetEnv
  when (i `mod` 100 == 99) $ do
    reduceExploration decayRate minEpsilon
  reward <- gameLearningLoop
  return reward
  where
    numEpisodes = 1000
    decayRate = 0.9
    minEpsilon = 0.01

Those parameters at the bottom could be inputs to our function or constants. But we see that this function will accumulate the total reward values for each run of our game.

Making a Class

This idealized function informs some of the pieces we'll need for a "Learning Environment" class. What's clear though is that this class should "wrap" the monad for our environment. In this way, we don't need to modify our exist game's monad just to make it learn a particular way. So the first thing we'll do with this class is use an associated type to assign our environment monad. We'll also want a lift function that will take actions in the environment/game and bring them into the learning monad.

class (Monad m) => LearningEnvironment m where
  type Env m :: * -> *
  liftEnv :: (Env m) a -> m a
  ...

Notice how the "kind" is * -> * because our environment is a monad!

Naturally, we'll also want a "learning state" that is separate from the environment's state. This will store our exploration rate, amoung other things. We'll include functions from getting and setting this state. This is also a good opportunity to include our exploration functions. We should be able to "get" it and then reduce it.

class (Monad m) => LearningEnvironment m where
  type Env m :: * -> *
  liftEnv :: (Env m) a -> m a
  type LearningState m :: *
  getLearningState :: m (LearningState m)
  putLearningState :: (LearningState m) -> m ()
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()
  ...

Finally, we reach our two critical functions, choosing an action and learning. Choosing an action will involve selecting an action corresponding to our environment. This is simple in concept, but the type signature gets a little odd:

class (Monad m) => LearningEnvironment m where
  ...
  chooseActionBrain ::
    (EnvironmentMonad (Env m)) => m (Action (Env m))

We have Env m which is our environment type, and then the Action is associated with that environment, hence Action (Env m). Plus, our environment is constrained by an EnvironmentMonad.

Now finally, the learn function. This takes four parameters

The "starting" observation The action we took based on that observation The "new" observation resulting from that action The reward from taking that action.

Then it will update the learning state, though it will not provide a return value.

class (Monad m) => LearningEnvironment m where
  ...
  learnEnv ::
    (EnvironmentMonad (Env m)) =>
    (Observation (Env m)) ->
    (Action (Env m)) ->
    (Observation (Env m)) ->
    Reward ->
    m ()

These definitions complete our class!

A Basic Implementation

As with the maze game itself, this code only runs if we create an instance for it! So let's start by defining our learning state type. What information do we need to store that will help us select our move and learning appropriately?

For this example, we're going to use a basic form of Q-Learning. In Q-Learning, we have a function that takes an observation and action and produces a score value. So in any given situation, our "move" is to select the action with the highest score. The rewards then let us calibrate how this function operates, gradually assigning higher scores to actions with higher rewards.

In the most basic form of Q-Learning, our function is a table where every combination of observation and action corresponds to a score. This approach doesn't scale to harder games with more options, but it helps illustrate the approach. So our learning state needs an array to represent this "Q-table".

It will also need to store the current exploration rate and a random generator, which will tell us when to make random moves (and which random move to select).

data MazeLearnerState = MazeLearnerState
  { qTable :: A.Array (Word, Word) Double
  , explorationR :: Double
  , randomGenerator :: StdGen
  }

Now our monadic type will be a state over both this "Learner State" and the "Maze Game State".

newtype MazeLearnerM a = MazeLearnerM
  (StateT (MazeLearnerState, MazeGameState) IO a)
  deriving (Functor, Applicative, Monad)

instance (MonadState (MazeLearnerState, MazeGameState)) MazeLearnerM
  ...

Why does it need both? This becomes clear when we start making the instance. To implement liftEnv, we'll "get" the state, and then pass it by "running" the environment.

instance LearningEnvironment MazeLearnerM where
  type (Env MazeLearnerM) = MazeGameM
  liftEnv (MazeGameM action) = do
    (ln, gs) <- get
    (result, gs') <- liftIO $ runStateT action gs 
    put (ln, gs')
    return result

Of course, we'll also assign our learner state and the getter/setter combination.

instance LearningEnvironment MazeLearnerM where
  type (LearningState MazeLearnerM) = MazeLearnerState
  getLearningState = fst <$> get
  putLearningState ln' = do
    (_, gs) <- get
    put (ln', gs)
  ...

The rest of this definition is pretty simple boilerplate, except for choosing the action and learning. So let's see how to implement the Q-Learning approach with these.

Q-Learning

To start, let's assume we have some helper functions. I'll list the type signatures without getting bogged down in the definitions. We need to convert back and forth between an Observation (which is a Location) and its index within our Q-table (a Word).

locationToIndex :: Location -> Grid -> Word

indexToLocation :: Word -> Grid -> Location

We also need a maxScore function. This will take a location/observation index (so a Word) as well as the Q-table, and produce the maximum score we get from that observation, considering all the possible moves.

maxScore ::
  Word -> A.Array (Word, Word) Double -> (Double, (Word, Word))

Now when it comes to selecting an action, we have two main branches. We have to start by "rolling the dice" and determining if this will be a random/exploratory move, or a "brain" move with our Q-table.

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then ... -- Explore randomly
    else ... -- Use our Q-table

The random move is a matter of taking a second roll over our 5 action possibilities, updating the learning state with the new generator, and then returning the enum corresponding to the selected number.

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then do
      let (actionRoll, gen'') = randomR (0, 4) gen'
      putLearningState $ lnSt { randomGenerator = gen'' }
      return (toEnum actionRoll)
    else ...

Now to use our Q-table, we retrieve our environment, convert our location into an index, get the max score for that index, and again convert that to an enum (replacing the random generator again).

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then ...
    else do
      env <- liftEnv get
      let obsIndex = locationToIndex (playerLoc env) (gameGrid env)
      let maxIndex = snd $ snd $ maxScore obsIndex (qTable lnSt)
      putLearningState $ lnSt { randomGenerator = gen' }
      return (toEnum (fromIntegral maxIndex))

To improve this, we could use the set of possible action from our underlying state, rather than hardcoding [0..4].

The Learn Function

Most of the logic for our learning function is straightforward. We retrieve our learning state and the game grid. We determine indices for the input observations and action so we can index into our Q-table.

learnQTable ::
  Location -> Direction -> Location -> Reward -> MazeLearnerM ()
learnQTable loc1 direction loc2 (Reward reward) = do
  lnSt <- getLearningState
  let q = qTable lnSt
  grid <- gameGrid <$> liftEnv get
  let actionIndex = fromIntegral . fromEnum $ direction
      observationIndex1 = locationToIndex loc1 grid
      observationIndex2 = locationToIndex loc2 grid
      ...

For our next steps. First, we get the prediction score value from the Q-table. Then we determine the "target" score value. This is based on the actual reward we got and the best score we can get from our new location. This second piece allows us to "propagate" rewards from the end to more intermediate stages.

We determine a new value to place in the Q-table which comes from this difference, modified by the learning rate. And finally, we place this new value in our Q-table and update the learning state.

learnQTable ::
  Location -> Direction -> Location -> Reward -> MazeLearnerM ()
learnQTable loc1 direction loc2 (Reward reward) = do
  lnSt <- getLearningState
  let q = qTable lnSt
  grid <- gameGrid <$> liftEnv get
  let actionIndex = fromIntegral . fromEnum $ direction
      observationIndex1 = locationToIndex loc1 grid
      observationIndex2 = locationToIndex loc2 grid
      prediction = q A.! (observationIndex1, actionIndex)
      target = reward + gamma * (fst $ maxScore observationIndex2 q)
      newValue = prediction + learningRate * (target - prediction)
      newQ = q A.// [((observationIndex1, actionIndex), newValue)]
  putLearningState $ lnSt { qTable = newQ }
  where
    gamma = 0.96
    learningRate = 0.81

Ads an improvement, we could also make "gamma" and the learning rate part of our state and change them over time.

Evaluating Our Game

So what does it look like to run this? Well our game loop functions from up above will work, but it will help us to also keep track of how many iterations are needed to win AND what the cumulative reward is (rather than just the final reward).

We can now also include the (rather complicated) type signatures and other modifications we need to work with our class.

gameLearningLoop ::
  (LearningEnvironment m, EnvironmentMonad (Env m)) =>
  (Int, Reward) -> m (Int, Reward)
gameLearningLoop (i, oldReward) = do
  oldObs <- liftEnv currentObservation
  newAction <- chooseActionBrain
  (newObs, reward, done) <- liftEnv $ stepEnv newAction
  learnEnv oldObs newObs reward newAction
  let newReward = oldReward + reward
  if done
    then return (i, newReward)
    else gameLearningLoop (i + 1, newReward)

gameLearningIterations ::
  (LearningEnvironment m, EnvironmentMonad (Env m)) =>
  m [(Int, Reward)]
gameLearningIterations = forM [1..numEpisodes] $ \i -> do
  liftEnv resetEnv
  when (i `mod` 100 == 99) $ do
    reduceExploration decayRate minEpsilon
  (count, reward) <- gameLearningLoop (0, Reward 0.0)
  return (count, reward)
  where
    numEpisodes = 1000
    decayRate = 0.9
    minEpsilon = 0.01

And last but not least, a bit of code to run this loop with a starting environment. We'll return the rewards and results from the first 10 runs, as well as the last 10 runs.

runLearningWithBase :: IO ([(Int, Reward)], [(Int, Reward)])
runLearningWithBase = do
  gen <- getStdGen
  let lnSt = MazeLearnerState
               (A.listArray ((0, 0), (15, 4)) (repeat 0.0))
               0.9
               gen
  results <- evalStateT
    (runMazeLearner gameLearningIterations)
     (lnSt, baseEnvironment)
  return $ (take 10 results, (drop (length results - 10)) results)

runMazeLearner ::
  MazeLearnerM a -> StateT (MazeLearnerState, MazeGameState) IO a
runMazeLearner (MazeLearnerM action) = action

Results!

With a few tweaks to our reward system, we can get some good results. First, we'll have a score of 50.0 for reaching the goal. Then a score of -1.0 for making an illegal move, as well as a score of -0.1 for making normal moves, to encourage faster progress.

In our first set of runs, we get values that take a lot longer, often requiring 30-50 moves to reach the goal. One example takes 175 moves!

[
  (46,Reward 30.1),
  (26,Reward 40.2),
  (39,Reward 37.1),
  (45,Reward 31.1),
  (51,Reward 29.6),
  (45,Reward 30.2),
  (175,Reward (-17.0)),
  (59,Reward 26.1),
  (56,Reward 26.4),
  (30,Reward 34.4)
]

Then in the latter set, we can see that single digit results are common (5 is optimal). Scores are much closer to 50, with fewer illegal moves made. Though some will still exist, since the exploration rate in always non-zero.

[
  (6,Reward 48.5),
  (11,Reward 48.0),
  (7,Reward 47.5),
  (5,Reward 49.5),
  (6,Reward 49.4),
  (8,Reward 49.2),
  (13,Reward 46.0),
  (13,Reward 46.0),
  (7,Reward 48.4),
  (10,Reward 48.1)
]

Haskell Brain

A lot of more precise learning algorithms for harder problems will require you to use more advanced tools like TensorFlow. Lucky for you, our Haskell Brain course is now open for enrollment! This course will teach you how to use the Haskell TensorFlow bindings to write simple machine learning programs in Haskell! So if you've always wanted to do this kind of AI-related work in Haskell, but didn't think the language had the tools, now is your chance to learn how to use one of the most important libraries in this field. So sign up today!

James Bowen