James Bowen 12/20/21 James Bowen 12/20/21

Learning to Navigate the Maze!

Previously on this series, we explored how we could adapt our very basic "Breadth First Search" game to be an Open AI Gym "Environment". This week, we'll take the final step and learn what it means to make our environment into a "Learning Environment". Instead of prescribing how our agent moves through the maze, we'll let it learn this for itself using reinforcement learning!

I won't go over every line of code in this particular article, but you can take a look at the full code by checking out this GitHub repository! The code we'll be looking at will be focused in the LearningEnvironment module and MazeLearner. If instead of reading about this you'd like to watch the code in action, take a look at this YouTube video!

A Learning API

Let's recall that we had this function that served as our "game loop". That is, it could take any environment and run through the game's iterations until it finished.

gameLoop ::
  (EnvironmentMonad m) => m (Action m) -> m (Observation m, Reward)
gameLoop chooseAction = do
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then return (newObs, reward)
    else gameLoop chooseAction

We also had a comparable function for a "Renderable" environment that would render the game state with each iteration.

What would it look like, at a high level, for us to make a "learning" loop? That is, what API functions would we want to be available to cause our game agent to learn and improve from iteration to iteration?

I propose we would want at least three elements. First, the "choose action" function should now be an explicit part of the state, rather than a function parameter. Second, we naturally need a "learn" function that takes the observations and rewards and adjusts whatever state we use for choosing the action.

Finally, we should be able to reduce our "exploration rate". For many learning algorithms, we'll want to allow it a chance to "explore" options at first rather than use its own brain. This prevents it from getting stuck in bad habits early on. But we want to reduce the probability of randomness over time so that it can assert the information it has learned.

We'll also want to add an extra layer to our loop. We want to run many iterations of the game over time, rather than a single iteration. After a certain number of iterations, we'll reduce the exploration rate.

Here's a first pass at what these functions might look like. Notice how they rely on our previous environment functions like current observation, stepEnv and resetEnv.

gameLearningLoop = do
  oldObs <- currentObservation
  newAction <- chooseActionBrain
  (newObs, reward, done) <- stepEnv newAction
  learnEnv oldObs newAction newObs reward 
  if done
    then return reward
    else gameLearningLoop

gameLearningIterations = forM [1..numEpisodes] $ \i -> do
  resetEnv
  when (i `mod` 100 == 99) $ do
    reduceExploration decayRate minEpsilon
  reward <- gameLearningLoop
  return reward
  where
    numEpisodes = 1000
    decayRate = 0.9
    minEpsilon = 0.01

Those parameters at the bottom could be inputs to our function or constants. But we see that this function will accumulate the total reward values for each run of our game.

Making a Class

This idealized function informs some of the pieces we'll need for a "Learning Environment" class. What's clear though is that this class should "wrap" the monad for our environment. In this way, we don't need to modify our exist game's monad just to make it learn a particular way. So the first thing we'll do with this class is use an associated type to assign our environment monad. We'll also want a lift function that will take actions in the environment/game and bring them into the learning monad.

class (Monad m) => LearningEnvironment m where
  type Env m :: * -> *
  liftEnv :: (Env m) a -> m a
  ...

Notice how the "kind" is * -> * because our environment is a monad!

Naturally, we'll also want a "learning state" that is separate from the environment's state. This will store our exploration rate, amoung other things. We'll include functions from getting and setting this state. This is also a good opportunity to include our exploration functions. We should be able to "get" it and then reduce it.

class (Monad m) => LearningEnvironment m where
  type Env m :: * -> *
  liftEnv :: (Env m) a -> m a
  type LearningState m :: *
  getLearningState :: m (LearningState m)
  putLearningState :: (LearningState m) -> m ()
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()
  ...

Finally, we reach our two critical functions, choosing an action and learning. Choosing an action will involve selecting an action corresponding to our environment. This is simple in concept, but the type signature gets a little odd:

class (Monad m) => LearningEnvironment m where
  ...
  chooseActionBrain ::
    (EnvironmentMonad (Env m)) => m (Action (Env m))

We have Env m which is our environment type, and then the Action is associated with that environment, hence Action (Env m). Plus, our environment is constrained by an EnvironmentMonad.

Now finally, the learn function. This takes four parameters

The "starting" observation The action we took based on that observation The "new" observation resulting from that action The reward from taking that action.

Then it will update the learning state, though it will not provide a return value.

class (Monad m) => LearningEnvironment m where
  ...
  learnEnv ::
    (EnvironmentMonad (Env m)) =>
    (Observation (Env m)) ->
    (Action (Env m)) ->
    (Observation (Env m)) ->
    Reward ->
    m ()

These definitions complete our class!

A Basic Implementation

As with the maze game itself, this code only runs if we create an instance for it! So let's start by defining our learning state type. What information do we need to store that will help us select our move and learning appropriately?

For this example, we're going to use a basic form of Q-Learning. In Q-Learning, we have a function that takes an observation and action and produces a score value. So in any given situation, our "move" is to select the action with the highest score. The rewards then let us calibrate how this function operates, gradually assigning higher scores to actions with higher rewards.

In the most basic form of Q-Learning, our function is a table where every combination of observation and action corresponds to a score. This approach doesn't scale to harder games with more options, but it helps illustrate the approach. So our learning state needs an array to represent this "Q-table".

It will also need to store the current exploration rate and a random generator, which will tell us when to make random moves (and which random move to select).

data MazeLearnerState = MazeLearnerState
  { qTable :: A.Array (Word, Word) Double
  , explorationR :: Double
  , randomGenerator :: StdGen
  }

Now our monadic type will be a state over both this "Learner State" and the "Maze Game State".

newtype MazeLearnerM a = MazeLearnerM
  (StateT (MazeLearnerState, MazeGameState) IO a)
  deriving (Functor, Applicative, Monad)

instance (MonadState (MazeLearnerState, MazeGameState)) MazeLearnerM
  ...

Why does it need both? This becomes clear when we start making the instance. To implement liftEnv, we'll "get" the state, and then pass it by "running" the environment.

instance LearningEnvironment MazeLearnerM where
  type (Env MazeLearnerM) = MazeGameM
  liftEnv (MazeGameM action) = do
    (ln, gs) <- get
    (result, gs') <- liftIO $ runStateT action gs 
    put (ln, gs')
    return result

Of course, we'll also assign our learner state and the getter/setter combination.

instance LearningEnvironment MazeLearnerM where
  type (LearningState MazeLearnerM) = MazeLearnerState
  getLearningState = fst <$> get
  putLearningState ln' = do
    (_, gs) <- get
    put (ln', gs)
  ...

The rest of this definition is pretty simple boilerplate, except for choosing the action and learning. So let's see how to implement the Q-Learning approach with these.

Q-Learning

To start, let's assume we have some helper functions. I'll list the type signatures without getting bogged down in the definitions. We need to convert back and forth between an Observation (which is a Location) and its index within our Q-table (a Word).

locationToIndex :: Location -> Grid -> Word

indexToLocation :: Word -> Grid -> Location

We also need a maxScore function. This will take a location/observation index (so a Word) as well as the Q-table, and produce the maximum score we get from that observation, considering all the possible moves.

maxScore ::
  Word -> A.Array (Word, Word) Double -> (Double, (Word, Word))

Now when it comes to selecting an action, we have two main branches. We have to start by "rolling the dice" and determining if this will be a random/exploratory move, or a "brain" move with our Q-table.

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then ... -- Explore randomly
    else ... -- Use our Q-table

The random move is a matter of taking a second roll over our 5 action possibilities, updating the learning state with the new generator, and then returning the enum corresponding to the selected number.

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then do
      let (actionRoll, gen'') = randomR (0, 4) gen'
      putLearningState $ lnSt { randomGenerator = gen'' }
      return (toEnum actionRoll)
    else ...

Now to use our Q-table, we retrieve our environment, convert our location into an index, get the max score for that index, and again convert that to an enum (replacing the random generator again).

chooseActionQTable :: MazeLearnerM Direction
chooseActionQTable = do
  lnSt <- getLearningState
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator lnSt)
  if exploreRoll < explorationR lnSt
    then ...
    else do
      env <- liftEnv get
      let obsIndex = locationToIndex (playerLoc env) (gameGrid env)
      let maxIndex = snd $ snd $ maxScore obsIndex (qTable lnSt)
      putLearningState $ lnSt { randomGenerator = gen' }
      return (toEnum (fromIntegral maxIndex))

To improve this, we could use the set of possible action from our underlying state, rather than hardcoding [0..4].

The Learn Function

Most of the logic for our learning function is straightforward. We retrieve our learning state and the game grid. We determine indices for the input observations and action so we can index into our Q-table.

learnQTable ::
  Location -> Direction -> Location -> Reward -> MazeLearnerM ()
learnQTable loc1 direction loc2 (Reward reward) = do
  lnSt <- getLearningState
  let q = qTable lnSt
  grid <- gameGrid <$> liftEnv get
  let actionIndex = fromIntegral . fromEnum $ direction
      observationIndex1 = locationToIndex loc1 grid
      observationIndex2 = locationToIndex loc2 grid
      ...

For our next steps. First, we get the prediction score value from the Q-table. Then we determine the "target" score value. This is based on the actual reward we got and the best score we can get from our new location. This second piece allows us to "propagate" rewards from the end to more intermediate stages.

We determine a new value to place in the Q-table which comes from this difference, modified by the learning rate. And finally, we place this new value in our Q-table and update the learning state.

learnQTable ::
  Location -> Direction -> Location -> Reward -> MazeLearnerM ()
learnQTable loc1 direction loc2 (Reward reward) = do
  lnSt <- getLearningState
  let q = qTable lnSt
  grid <- gameGrid <$> liftEnv get
  let actionIndex = fromIntegral . fromEnum $ direction
      observationIndex1 = locationToIndex loc1 grid
      observationIndex2 = locationToIndex loc2 grid
      prediction = q A.! (observationIndex1, actionIndex)
      target = reward + gamma * (fst $ maxScore observationIndex2 q)
      newValue = prediction + learningRate * (target - prediction)
      newQ = q A.// [((observationIndex1, actionIndex), newValue)]
  putLearningState $ lnSt { qTable = newQ }
  where
    gamma = 0.96
    learningRate = 0.81

Ads an improvement, we could also make "gamma" and the learning rate part of our state and change them over time.

Evaluating Our Game

So what does it look like to run this? Well our game loop functions from up above will work, but it will help us to also keep track of how many iterations are needed to win AND what the cumulative reward is (rather than just the final reward).

We can now also include the (rather complicated) type signatures and other modifications we need to work with our class.

gameLearningLoop ::
  (LearningEnvironment m, EnvironmentMonad (Env m)) =>
  (Int, Reward) -> m (Int, Reward)
gameLearningLoop (i, oldReward) = do
  oldObs <- liftEnv currentObservation
  newAction <- chooseActionBrain
  (newObs, reward, done) <- liftEnv $ stepEnv newAction
  learnEnv oldObs newObs reward newAction
  let newReward = oldReward + reward
  if done
    then return (i, newReward)
    else gameLearningLoop (i + 1, newReward)

gameLearningIterations ::
  (LearningEnvironment m, EnvironmentMonad (Env m)) =>
  m [(Int, Reward)]
gameLearningIterations = forM [1..numEpisodes] $ \i -> do
  liftEnv resetEnv
  when (i `mod` 100 == 99) $ do
    reduceExploration decayRate minEpsilon
  (count, reward) <- gameLearningLoop (0, Reward 0.0)
  return (count, reward)
  where
    numEpisodes = 1000
    decayRate = 0.9
    minEpsilon = 0.01

And last but not least, a bit of code to run this loop with a starting environment. We'll return the rewards and results from the first 10 runs, as well as the last 10 runs.

runLearningWithBase :: IO ([(Int, Reward)], [(Int, Reward)])
runLearningWithBase = do
  gen <- getStdGen
  let lnSt = MazeLearnerState
               (A.listArray ((0, 0), (15, 4)) (repeat 0.0))
               0.9
               gen
  results <- evalStateT
    (runMazeLearner gameLearningIterations)
     (lnSt, baseEnvironment)
  return $ (take 10 results, (drop (length results - 10)) results)

runMazeLearner ::
  MazeLearnerM a -> StateT (MazeLearnerState, MazeGameState) IO a
runMazeLearner (MazeLearnerM action) = action

Results!

With a few tweaks to our reward system, we can get some good results. First, we'll have a score of 50.0 for reaching the goal. Then a score of -1.0 for making an illegal move, as well as a score of -0.1 for making normal moves, to encourage faster progress.

In our first set of runs, we get values that take a lot longer, often requiring 30-50 moves to reach the goal. One example takes 175 moves!

[
  (46,Reward 30.1),
  (26,Reward 40.2),
  (39,Reward 37.1),
  (45,Reward 31.1),
  (51,Reward 29.6),
  (45,Reward 30.2),
  (175,Reward (-17.0)),
  (59,Reward 26.1),
  (56,Reward 26.4),
  (30,Reward 34.4)
]

Then in the latter set, we can see that single digit results are common (5 is optimal). Scores are much closer to 50, with fewer illegal moves made. Though some will still exist, since the exploration rate in always non-zero.

[
  (6,Reward 48.5),
  (11,Reward 48.0),
  (7,Reward 47.5),
  (5,Reward 49.5),
  (6,Reward 49.4),
  (8,Reward 49.2),
  (13,Reward 46.0),
  (13,Reward 46.0),
  (7,Reward 48.4),
  (10,Reward 48.1)
]

Haskell Brain

A lot of more precise learning algorithms for harder problems will require you to use more advanced tools like TensorFlow. Lucky for you, our Haskell Brain course is now open for enrollment! This course will teach you how to use the Haskell TensorFlow bindings to write simple machine learning programs in Haskell! So if you've always wanted to do this kind of AI-related work in Haskell, but didn't think the language had the tools, now is your chance to learn how to use one of the most important libraries in this field. So sign up today!

James Bowen 12/13/21 James Bowen 12/13/21

Implementation: Creating a Maze Environment

In our last episode we explored the similarities and differences of the "World" idea from Gloss and the "Environment" of Open AI Gym. We designed a Haskell typeclass to capture this idea of an environment that could use associated data types connected for the specific game we're playing.

Now this week we're going to show this class in action. We're going to take our toy example of using BFS to search through a maze and we're going to make an Environment for it! We'll see how all these details fit together and we'll see the game play out in our console!

The code in this article is available on GitHub! For this article, you should focus on the MazeEnvironment module. If you'd like to watch this article as a video instead, take a look at this YouTube video!

There's also a special announcement at the end of this post if you're interested in AI and machine learning, so make sure you read to the end! Let's get started!

Defining Our Types

I may sound like a broken record at this point, but we're going to start by defining some types! First we'll make the State type for our environment, which will be functionally equivalent to our previous World type in Gloss.

data MazeGameState = MazeGameState
  { playerLoc :: Location
  , startLoc :: Location
  , endLoc :: Location
  , gameGrid :: Grid
  }

This state isn't the object of our class though! In order to do that, we have to build a monadic type. So what monad should we use? Naturally, we want to use the State monad to track our "State" type. Seems pretty obvious. We'll also include IO so that we can render our game later.

newtype MazeGameM a = MazeGameM (StateT MazeGameState IO a)
  deriving (Functor, Applicative, Monad)

Let's also define some basic monad instances like MonadState and MonadIO. This will make our life easier when we write our implementation code!

instance (MonadState MazeGameState) MazeGameM where
  get = MazeGameM get
  put env = MazeGameM $ put env

instance MonadIO MazeGameM where
  liftIO action = MazeGameM (lift action)

Instance Types

Now that we've got our monadic type we're ready to start making our instance. First, we want to assign the associated types. Remember these are the Observation type, the Action type, and the "Environment State" type.

When it comes to the observation, the "mutable" data we have in our game state is just the player's location. So we can assign Location as the corresponding type in our class.

For the "action", we have 5 options. We can move in any of the four cardinal directions, or we can make no move. So let's define an enum for that.

data Direction =
  MoveUp |
  MoveLeft |
  MoveDown |
  MoveRight |
  MoveNone
  deriving (Show)

And of course we'll use our state from above for the environment type. So here's what our instance looks like so far:

instance EnvironmentMonad MazeGameM where
  type (Observation MazeGameM) = Location
  type (Action MazeGameM) = Direction
  type (EnvironmentState MazeGameM) = MazeGameState
  ...

Simple Functions

Now we can start filling in the functions and expressions for the instance. To "reset" our environment, we just want to change the player's location to the start:

resetGame :: MazeGameM Location
resetGame = do
  current <- get
  put $ current { playerLoc = startLoc current }
  return (startLoc current)

Running the environment is also rather simple. Give an action in our monad, we'll unwrap its state action, run that with the given environment, and produce the IO action!

instance EnvironmentMonad MazeGameM where
  ...
  resetEnv = resetGame
  runEnv env (MazeGameM action) = evalStateT action env
  ...

After that, getting the "current" observation is as easy as querying for the player's location. And for "possible actions" we'll just return a full list of the enum values. If we wanted to get fancy, we could remove the option of moving into walls or off the board, but we'll need to handle that logic later anyways, so we'll keep this part simple.

instance EnvironmentMonad MazeGameM where
  type (Observation MazeGameM) = Location
  type (Action MazeGameM) = Direction
  type (EnvironmentState MazeGameM) = MazeGameState
  resetEnv = resetGame
  runEnv env (MazeGameM action) = evalStateT action env
  currentObservation = MazeGameM (playerLoc <$> get)
  possibleActions _ = return [MoveUp, MoveLeft, MoveDown, MoveRight, MoveNone]
  ...

Stepping Forward

We just have one field left, but it's the most complicated! We need to determine how to "step" the game based on an action. This will always be the toughest part because this is where all your game logic really happens! Our game is pretty simple though. Let's actually start with a few helpers.

First, let's determine if a space is a valid move in our grid. We just check that it's in bounds and that it is not a wall:

isValidLoc :: Location -> Grid -> Bool
isValidLoc (r, c) grid =
  r >= 0 &&
  c >= 0 &&
  r <= (fst . snd) (A.bounds grid) &&
  c <= (snd . snd) (A.bounds grid) &&
  grid A.! (r, c) == Empty

Now we want two functions that are kind of inverses. We'll use findNextLoc to take a current location and the direction, and give us the next location. Then moveDirection will do the opposite, taking a start and end point and giving us the direction between them.

findNextLoc :: Direction -> Location -> Location
findNextLoc MoveUp (r, c) = (r - 1, c)
findNextLoc MoveLeft (r, c) = (r, c - 1)
findNextLoc MoveDown (r, c) = (r + 1, c)
findNextLoc MoveRight (r, c) = (r, c + 1)
findNextLoc MoveNone (r, c) = (r, c)

moveDirection :: Location -> Location -> Direction
moveDirection (r, c) nextLoc
  | nextLoc == (r - 1, c) = MoveUp
  | nextLoc == (r, c - 1) = MoveLeft
  | nextLoc == (r + 1, c) = MoveDown
  | nextLoc == (r, c + 1) = MoveRight
  | otherwise = MoveNone

Now we're ready to write our step function. Recall that this function will take our game's action, which is a direction that we desire to move. We start by retrieving the game state and the current location.

stepGame :: Direction -> MazeGameM (Location, Reward, Bool)
stepGame direction = do
  current <- get
  let currentLoc = playerLoc current
  ...

Now we can find the next location based on this direction. If it's valid, we'll assign it as our "final" location (if not, we use the previous location). Then we save this in our state with "put".

stepGame :: Direction -> MazeGameM (Location, Reward, Bool)
stepGame direction = do
  current <- get
  let currentLoc = playerLoc current
  let nextLoc = findNextLoc direction currentLoc
  let finalLoc = if isValidLoc nextLoc (gameGrid current)
                   then nextLoc
                   else currentLoc
  put $ current { playerLoc = finalLoc }
  ...

Finally, we must determine our return values! Of course, the final location provides our new observation. If our new location is the final location, we'll provide the user with a "reward" of 1.0 and declare the game to be "done". Otherwise, we give no reward and the game goes on.

stepGame :: Direction -> MazeGameM (Location, Reward, Bool)
stepGame direction = do
  current <- get
  let currentLoc = playerLoc current
  let nextLoc = findNextLoc direction currentLoc
  let finalLoc = if isValidLoc nextLoc (gameGrid current)
                   then nextLoc
                   else currentLoc
  put $ current { playerLoc = finalLoc }
  let done = finalLoc == endLoc current
  let reward = if currentLoc /= finalLoc && done
                  then Reward 1.0
                  else Reward 0.0
  return (finalLoc, reward, done)

Filling this function in for stepEnv then completes our instance definition!

```haskell
instance EnvironmentMonad MazeGameM where
  ...
  stepEnv = stepGame

Playing Our Game

Now to play the game, we need to supply a "brain". That is, we need to be able to choose an action based on the game state. First, we retrieve the environment:

chooseMoveMaze :: MazeGameM Direction
chooseMoveMaze = do
  env <- get
  ...

With access to the game's grid and the relevant locations, we can then turn to our trusty Breadth-First-Search function!

chooseMoveMaze :: MazeGameM Direction
chooseMoveMaze = do
  env <- get
  let path = bfsSearch (gameGrid env) (playerLoc env) (endLoc env)
  ...

If there is no path, our direction is "None". Otherwise, we can take the first item from the path, and use moveDirection to calculate the direction we need. And then we're done!

chooseMoveMaze :: MazeGameM Direction
chooseMoveMaze = do
  env <- get
  let path = bfsSearch (gameGrid env) (playerLoc env) (endLoc env)
  case path of
    [] -> return MoveNone
    (move : _) -> return $ moveDirection (playerLoc env) move

Now we can play our game! We'll create a basic environment, and use gameLoop from last week, in conjunction with our brain function!

>> let baseEnv = MazeGameState (0,0) (0, 0) (2, 2) baseGrid
>> runEnv baseEnv (gameLoop chooseMoveMaze)

Rendering Our Game

This is great! We can now play our game...in theory. But in practice, we'd like to be able to see what's going on! So let's make a render function and make our environment renderable! Let's start by defining an ASCII character to correspond with each kind of location in the game.

Use 'o' for the player's location
Use 'x' for walls
Use 'F' for the "finish"
Use underscore '_' for blank spaces

This is a simple function to write:

charForLoc :: MazeGameState -> Location -> Char
charForLoc env loc = if loc == playerLoc env
  then 'o'
  else if loc == endLoc env
    then 'F'
    else if gameGrid env A.! loc == Wall then 'x' else '_'

Now to render, we just have to divide our Array into its constituent rows. The groupBy function is the easiest way to do this. We use the first element of the tuple index to do the matching.

instance RenderableEnvironment MazeGameM where
  renderEnv = do
    env <- get
    let rows = groupBy
                 (\((a, _), _) ((b, _), _) -> a == b)
                 (A.assocs (gameGrid env))
    ...

Now we just do a nested loop, and print the character for each cell!

instance RenderableEnvironment MazeGameM where
  renderEnv = do
    env <- get
    let rows = groupBy
                 (\((a, _), _) ((b, _), _) -> a == b)
                 (A.assocs (gameGrid env))
    forM_ rows $ \row -> liftIO $ do
      forM_ row $ \(l, _) -> putChar (charForLoc env l)
      putStr "\n"
    liftIO $ putStr "\n"

Now instead of using gameLoop like above, we can use gameRenderLoop!

>> let baseEnv = MazeGameState (0,0) (0, 0) (2, 2) baseGrid
>> runEnv baseEnv (gameRenderLoop chooseMoveMaze)
...

It will display the game at each stage so we can see our player moving along!

o___
_xx_
_xF_
____

_o__
_xx_
_xF_
____

__o_
_xx_
_xF_
____

___o
_xx_
_xF_
____

____
_xxo
_xF_
____

____
_xx_
_xFo
____

____
_xx_
_xo_
____

Special Announcement!

Now the only thing cooler than making a game with a working AI is having that AI learn for itself what to do! Next time, we'll modify our environment so it can use machine learning to improve an agent's behavior over time! We'll use a simple reinforcement learning approach to help our player navigate this simple maze.

If you've made it this far, you deserve to hear our special announcement, which is very much related to this idea of machine learning. One of the most important technologies when it comes to machine learning these days is TensorFlow. But as with Open Gym AI, most of the support for TensorFlow is in Python, not Haskell.

But a Haskell library does exist for TensorFlow! And today I am releasing a new course called Haskell Brain that will help you learn how to use this library. It goes over the basics of getting setup with Haskell and TensorFlow on your machine, as well as all the conceptual and syntactic ideas you need to get started writing your code. So if combining machine learning and Haskell sounds like a thrilling idea to you, then head to our course page now!

James Bowen 12/6/21 James Bowen 12/6/21

From World to Environment: Open AI Gym Primer

In last week's article, we briefly entered the world of Haskell's Gloss library and illustrated our search algorithm in action. An integral part of this process was creating and using a particular World type to store information about the state of our game and process updates.

This week we'll discuss the Open AI Gym. This framework is widely used to help people learn about AI algorithms and how to train them using machine learning techniques. It has its own concept of an "environment" that is very similar to this "World" idea. It's worth comparing these concepts, and it's especially fun to consider how to re-create the "environment" in Haskell. This is a (somewhat) novel area where type families can help us out. So read on to learn how!

You can see all the code for this series on GitHub! For this article, you should look at the Environment module. This article is also available as a video on YouTube!

Review of World Type

Let's recall our World type from last time:

data World = World
  { playerLocation :: Location
  , startLocation :: Location
  , endLocation :: Location
  , worldGrid :: Grid
  }

This type stores all the information, both mutable and immutable, about our game. It tells us the grid that we are navigating, as well as the current location of the "player", which can change.

Our game then is largely defined by functions that operate on this world:

play :: Display -> Color -> Int
  -> world
  -> (world -> Picture)
  -> (Event -> world -> world)
  -> (Float -> world -> world)
  -> IO ()

drawingFunc :: World -> Picture

inputHandler :: Event -> World -> World

updateFunc :: Float -> World -> World

We require a function to draw our world, a function to change the world based on user actions, and a function to let the world evolve naturally. And, of course, we need to create our initial world in order to start the game off.

Open Gym Environment

Now let's compare that to some code from Open AI Gym. Here's a Python snippet you can find on the Open AI Gym website:

import gym
env = gym.make("CartPole-v1")
observation = env.reset()
for _ in range(1000):
  env.render()
  action = env.action_space.sample() # Takes a random action
  observation, reward, done, info = env.step(action)

  if done:
    observation = env.reset()
env.close()

Let's note how this environment is used:

We create it ("make") and can "reset" it. Resetting it produces an "observation". The environment has an "action space", a set of available actions. We can "step" the environment forward using one of these actions, producing a new observation, a reward, and a boolean indicating if we're "done". We can "render" the environment.

Like our World, an environment clearly stores all the information about a game. But we also have this subset of information that we refer to as the "observation". This, generally speaking, is information a player of the game actually has access to, and it ought to be mutable.

Next, we "step" the game forward using one of the possible actions at a point. This is a combination of the input handler and update function from our Gloss example. When we step forward, we usually impact the world with our action, but the world also goes through its own natural evolution. This produces a new observation from which we'll base our next action.

We also see a "reward" as a result of each action. This is something unique to the environment. Rewards are very important in training any kind of algorithm using machine learning. It's how we tell our program what a "good" move is.

And of course, it's useful to be able to render or draw our environment, though this isn't strictly necessary for the game's logic.

Making a Haskell Environment

There's a distinct drawback of using Python though. The types of several of our environment-related expressions above are unclear! What information, exactly, is stored in the environment? What does an "action" look like, or an "observation"? In very basic games, both the action and observation can be simple integers, but it's tricky to make heads or tails of that.

So let's consider what an "Environment" with this kind of API would look like in Haskell. We're tempted of course, to make this a specific type. But we don't have type-level inheritance in Haskell. And we'll want to create some kind of pattern that different games can inherit from. So it's actually better to make this a typeclass. And, since our game will need to have different side effects, we'll make it a monadic typeclass:

class (Monad m) => EnvironmentMonad m where
  ...

And this is where the fun begins! Each different game environment will have its types associated with it, corresponding to the environment state, an action, and an observation. So we can use type family syntax to associate these types with the class variable m:

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  type EnvironmentState m :: *
  ...

We can use these types within our other class functions as well. For example, we should be able to produce the current observation. And given an observation, we should be able to describe the available actions. We should also be able to reset the environment.

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  type EnvironmentState m :: *
  currentObservation :: m (Observation m)
  possibleActions :: Observation m -> m [Action m]
  resetEnv :: m (Observation m)
  ...

Finally, we need two more items. First, our "step" function. This takes an action as input, and it produces a new observation, a reward, and a boolean, indicating that we are done. Then the last item will be more Haskell specific. This will be a "run" function. It will allow us to take an action in our monad, combined with the environment state, and turn it into a normal IO action we can run elsewhere.

newtype Reward = Reward Double

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  type EnvironmentState m :: *
  currentObservation :: m (Observation m)
  possibleActions :: Observation m -> m [Action m]
  resetEnv :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)
  runEnv :: (EnvironmentState m) -> m a -> IO a

If we are interested in rendering our environment, we can make a new typeclass that inherits from our base class. It should also inherit from IO, because any kind of rendering will involve IO.

class (MonadIO m, EnvironmentMonad m) => RenderableEnvironment m where
  renderEnv :: m ()

Using this class, we can write some generic code that will work on any game! Here's a couple loop functions. This first will work on any environment, though it requires we supply our own function to choose an action. This is really the "brain" of the game, which we'll get into more next time!

gameLoop ::
  (EnvironmentMonad m) => m (Action m) -> m (Observation m, Reward)
gameLoop chooseAction = do
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then return (newObs, reward)
    else gameLoop chooseAction

And if we want to render our game each time, we can just add this separate constraint, and add the extra render steps in between!

gameRenderLoop :: (RenderableEnvironment m) => m (Action m) -> m (Observation m, Reward)
gameRenderLoop chooseAction = do
  renderEnv
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then renderEnv >> return (newObs, reward)
    else gameRenderLoop chooseAction

Conclusion

So there are a lot of similarities between these two systems, but clearly Open AI Gym is a little more involved and detailed. But Haskell provides some interesting mechanisms for us to add more type-clarity around our environments.

Next week, we'll actually use this environment class and apply it to our simple Breadth-First-Search example. This will really get us started on the road to applying machine learning to this problem, so you won't want to miss out! Make sure to subscribe to Monday Morning Haskell so you can stay up to date with what's going on here!

James Bowen 11/29/21 James Bowen 11/29/21

See and Believe: Visualizing with Gloss

Last week I discussed AI for the first time in a while. We learned about the Breadth-First-Search algorithm (BFS) which is so useful in a lot of simple AI applications. But of course writing abstract algorithms isn't as interesting as seeing them in action. So this week I'll re-introduce Gloss, a really neat framework I've used to make some simple games in Haskell.

This framework simplifies a lot of the graphical work one needs to do to make stuff show up on screen and it allows us to provide Haskell code to back it up and make all the logic interesting. I think Gloss also gives a nice demonstration of how we really want to structure a game and, in some sense, any kind of interactive program. We'll break down how this structure works as we make a simple display showing the BFS algorithm in practice. We'll actually have a "player" piece navigating a simple maze by itself.

To see the complete code, take a look at this GitHub repository! The Gloss code is all in the Game module.

Describing the World

In Haskell, the first order of business is usually to define our most meaningful types. Last week we did that by specifying a few simple aliases and types to use for our search function:

type Location = (Int, Int)
data Cell = Empty | Wall
  deriving (Eq)
type Grid = A.Array Location Cell

When we're making a game though, there's one type that is way more important than the rest, and this is our "World". The World describes the full state of the game at any point, including both mutable and immutable information.

In describing our simple game world, we might view three immutable elements, the fundamental constraints of the game. These are the "start" position, the "end" position, and the grid itself. However, we'll also want to describe the "current" position of our player, which can change each time it moves. This gives us a fourth field.

data World = World
  { playerLocation :: Location
  , startLocation :: Location
  , endLocation :: Location
  , worldGrid :: Grid
  }

We can then supplement this by making our "initial" elements. We'll have a base grid that just puts up a simple wall around our destination, and then make our starting World.

-- looks like:
-- S o o o
-- o x x o
-- o x F o
-- o o o o
baseGrid :: Grid
baseGrid =
  (A.listArray ((0, 0), (3, 3)) (replicate 16 Empty))
  A.//
  [((1, 1), Wall), ((1, 2), Wall), ((2, 1), Wall)]

initialWorld :: World
initialWorld = World (0, 0) (0, 0) (2, 2) baseGrid

Playing the Game

We've got our main type in place, but we still need to pull it together in a few different ways. The primary driver function of the Gloss library is play. We can see its signature here.

play :: Display -> Color -> Int
  -> world
  -> (world -> Picture)
  -> (Event -> world -> world)
  -> (Float -> world -> world)
  -> IO ()

The main pieces of this are driven by our World type. But it's worth briefly addressing the first three. The Display describes the viewport that will show up on our screen. We can give it particular dimensions and offset:

windowDisplay :: Display
windowDisplay = InWindow "Window" (200, 200) (10, 10)

The next two values just indicate the background color of the screen, and the tick rate (how many game ticks occur per second). And after those, we just have our initial world value as we made above.

main :: IO ()
main = play
  windowDisplay white 1 initialWorld
  ...

But now we have three more functions that are clearly driven by our World type. The first is a drawing function. It takes the current state of the world and create a Picture to show on screen.

The second function is an input handler, which takes a user input event as well as the current world state, and returns an updated world state, based on the event. We won't address this in this article.

The third function is an update function. This describes how the world naturally evolves without any input from tick to tick.

For now, we'll make type signatures as we prepare to implement these functions for ourselves. This allows us to complete our main function:

main :: IO ()
main = play
  windowDisplay white 20 initialWorld
  drawingFunc
  inputHandler
  updateFunc

drawingFunc :: World -> Picture

inputHandler :: Event -> World -> World

updateFunc :: Float -> World -> World

Let's move on to these different world-related functions.

Updating the World

Now let's handle updates to the world. To start, we'll make a stubbed out input-handler. This will just return the input world each tick.

inputHandler :: Event -> World -> World
inputHandler _ w = w

Now let's describe how the world will naturally evolve/update with each game tick. For this step, we'll apply our BFS algorithm. So all we really need to do is retrieve the locations and grid out of the world and run the function. If it gives us a non-empty list, we'll substitute the first square in that path for our new location. Otherwise, nothing happens!

updateFunc :: Float -> World -> World
updateFunc _ w@(World playerLoc _ endLoc grid time) =
  case path of
    (first : rest) -> w {playerLocation = first}
    _ -> w
  where
    path = bfsSearch grid playerLoc endLoc

Note that this function receives an extra "float" argument. We don't need to use this.

Drawing

Finally, we need to draw our world so we can see what is going on! To start, we need to remember the difference between the "pixel" positions on the screen, and the discrete positions in our maze. The former are floating point values up to (200.0, 200.0), while the latter are integer numbers up to (3, 3). We'll make a type to store the center and corner points of a given cell, as well as a function to generate this from a Location.

A lot of this is basic arithmetic, but it's easy to go wrong with sign errors and off-by-one errors!

data CellCoordinates = CellCoordinates
  { cellCenter :: Point
  , cellTopLeft :: Point
  , cellTopRight :: Point
  , cellBottomRight :: Point
  , cellBottomLeft :: Point
  }

-- First param: (X, Y) offset from the center of the display to center of (0, 0) cell
-- Second param: Full width of a cell
locationToCoords :: (Float, Float) -> Float -> Location -> CellCoordinates
locationToCoords (xOffset, yOffset) cellSize (x, y) = CellCoordinates
  (centerX, centerY)
  (centerX - halfCell, centerY + halfCell) -- Top Left
  (centerX + halfCell, centerY + halfCell) -- Top Right
  (centerX + halfCell, centerY - halfCell) -- Bottom Right
  (centerX - halfCell, centerY - halfCell) -- Bottom Left
  where
    (centerX, centerY) = (xOffset + (fromIntegral x) * cellSize, yOffset - (fromIntegral y) * cellSize)
    halfCell = cellSize / 2.0

Now we need to use these calculations to draw pictures based on the state of our world. First, let's write a conversion that factors in the specifics of the display, which allows us to pinpoint the center of the player marker.

drawingFunc :: World -> Picture
drawingFunc world =
  ...
  where
    conversion = locationToCoords (-75, 75) 50
    (px, py) = cellCenter (conversion (playerLocation world))

Now we can draw a circle to represent that! We start by making a Circle that is 10 pixels in diameter. Then we translate it by the coordinates. Finally, we'll color it red. We can add this to a list of Pictures we'll return.

drawingFunc :: World -> Picture
drawingFunc world = Pictures
  [ playerMarker ]
  where
    -- Player Marker
    conversion = locationToCoords (-75, 75) 50
    (px, py) = cellCenter (conversion (playerLocation world))
    playerMarker = Color red (translate px py (Circle 10))

Now we'll make Polygon elements to represent special positions on the board. Using the corner elements from CellCoordinates, we can draw a blue square for the start position and a green square for the final position.

drawingFunc :: World -> Picture
drawingFunc world = Pictures
  [startPic, endPic, playerMarker ]
  where
    -- Player Marker
    conversion = locationToCoords (-75, 75) 50
    (px, py) = cellCenter (conversion (playerLocation world))
    playerMarker = Color red (translate px py (Circle 10))

    # Start and End Pictures
    (CellCoordinates _ stl str sbr sbl) = conversion (startLocation world)
    startPic = Color blue (Polygon [stl, str, sbr, sbl])
    (CellCoordinates _ etl etr ebr ebl) = conversion (endLocation world)
    endPic = Color green (Polygon [etl, etr, ebr, ebl])

Finally, we do the same thing with our walls. First we have to filter all the elements in the grid to get the walls. Then we must make a function that will take the location and make the Polygon picture. Finally, we combine all of these into one picture by using a Pictures list, mapped over these walls. Here's the final look of our function:

drawingFunc :: World -> Picture
drawingFunc world = Pictures
  [gridPic, startPic, endPic, playerMarker ]
  where
    -- Player Marker
    conversion = locationToCoords (-75, 75) 50
    (px, py) = cellCenter (conversion (playerLocation world))
    playerMarker = Color red (translate px py (Circle 10))

    # Start and End Pictures
    (CellCoordinates _ stl str sbr sbl) = conversion (startLocation world)
    startPic = Color blue (Polygon [stl, str, sbr, sbl])
    (CellCoordinates _ etl etr ebr ebl) = conversion (endLocation world)
    endPic = Color green (Polygon [etl, etr, ebr, ebl])

    # Drawing the Pictures for the Walls
    walls = filter (\(_, w) -> w == Wall) (A.assocs $ worldGrid world)
    mapPic (loc, _) = let (CellCoordinates _ tl tr br bl) = conversion loc 
                          in Color black (Polygon [tl, tr, br, bl])
    gridPic = Pictures (map mapPic walls)

And now when we play the game, we'll see our circle navigate to the goal square!

Next time, we'll look at a more complicated version of this kind of game world!

James Bowen 11/22/21 James Bowen 11/22/21

AI Revisited: Breaking Down BFS

So we're approaching the end of the year, and of all the topics that I've tended to focus on in my writings, there's one that I haven't really written about in probably over a year, and this is AI and Machine Learning. I've still been doing some work behind the scenes, as you'll know if you keep following the blog for the next few weeks. But I figured I'd spend the last few weeks of the year with some AI related topics. This week, I'll go over an algorithm that is really useful to understand when it comes to writing simple AI programs, and this is Breadth-First-Search.

All the code for the next few weeks can be found in this GitHub repository! For this week, all the code can be found in the BFS module.

The Algorithm

To frame this problem in a more concrete way, let's imagine we have a 2-dimensional grid. Some spaces are free, other spaces are "walls". We want to use breadth first search to find a path from a start point to a destination point.

a___
_xx_
_xb_
____

So our algorithm will take two locations, and return a path from location A to Location B, or an empty list if no path can be found.

The key data structure when executing a breadth-first-search is a queue. Our basic approach is this: we will place our starting location in the queue. Then, we'll go through a loop as long as the queue is not empty. We'll pull an item off, and then add each of the empty neighbors on to the back of the queue, as long as they haven't been added yet. If we dequeue the destination, we're done! But if we reach an empty queue, then we don't have a valid path.

The last tricky part is that we to track the "parent" of each location. That is, which of its neighbors placed it on the queue? This will allow us to reconstruct the path we need to take to get from point a to point b.

So let's imagine we have a simple graph like in the ASCII art above. We start at (0,0). Our queue will operate like this.

It contains (0,0). We'll then enqueue (0, 1) and (1, 0), since those are the neighbors of (0, 0).

(0, 0) <-- Current
(0, 1)
(1, 0)

Then we're done with (0, 0). So we dequeue (0, 1). This its only neighbor is (0, 2), so that gets placed on the end of the queue.

(0, 1) <-- Current
(1, 0)
(0, 2)

And then we repeat the process with (1, 0), placing (0, 2).

(1, 0) <-- Current
(0, 2)
(2, 0)

We keep doing this until we navigate around to our destination at (2,2).

Types First

How do we translate this to Haskell? My favorite approach to problems like this is to use a top-down, type driven, compile-first method of writing the algorithm. Because before we can really get started in earnest, we have to define our data structures and our types. First, let's alias an integer tuple as a "Location":

type Location = (Int, Int)

Now, we're going to imagine we're navigating a 2D grid, and we'll represent this with an array where the indices are tuples which represent locations, and each value is either "empty" or "wall". We can move through empty spaces, but we cannot move through walls.

data Cell = Empty | Wall
  deriving (Eq)
type Grid = A.Array Location Cell

Now we're ready to define the type signature for our function. This takes the grid as an input, as well as the start and end location:

bfsSearch :: Grid -> Location -> Location -> [Location]

We'll need one more type to help frame the problem. This algorithm will use the State monad, because there's a lot of state we need to track here. First off, we need the queue itself. We represent this with the Sequence type in Haskell. Then, we need our set of visited locations. Each time we enqueue a location, we'll save it here. Last, we need our "parents" map. This will help us determine the path at the very end.

data BFSState = BFSState
  { queue :: S.Seq Location
  , visited :: Set.Set Location
  , parents :: M.Map Location Location
  }

A Stateful Skeleton

With these types, we can start framing the problem a bit more. First, we want to construct our initial state. Everything is empty except our queue has the starting location on it.

bfsSearch :: Grid -> Location -> Location -> [Location]
bfsSearch grid start finish = ...
  where
    initialState = BFSState (S.singleton start) Set.empty M.empty

Now we want to pass this function to a stateful computation that returns our list. So we'll imagine we have a helper in the State monad which returns our location. We'll call this bfsSearch'. We can then fill in our original function with evalState.

bfsSearch :: Grid -> Location -> Location -> [Location]
bfsSearch grid start finish = evalState (bfsSearch' grid finish) initialState
  where
    initialState = BFSState (S.singleton start) Set.empty M.empty

bfsSearch' :: Grid -> Location -> State BFSState [Location]
...

Base Case

Now within our stateful helper, we can recognize that this will be a recursive function. We dequeue an element, enqueue its neighbors, and then repeat the process. So let's handle the base cases first. We'll retrieve our sequence from the state and check if it's empty or not. If it's empty, we return the empty list. This means that we couldn't find a path.

bfsSearch' :: Grid -> Location -> State BFSState [Location]
bfsSearch' grid finish = do
  (BFSState q v p) <- get
  case S.viewl q of
    (top S.:< rest) -> ...
    _ -> return []

Now another base case is where the top of our queue is the destination. In this case, we're ready to "unwind" the path from that destination in our stateful map. Let's imagine we have a function to handle this unwinding process. We'll fill it in later.

bfsSearch' :: Grid -> Location -> State BFSState [Location]
bfsSearch' grid finish = do
  (BFSState q v p) <- get
  case S.viewl q of
    (top S.:< rest) -> if top == finish
      then return (unwindPath p [finish])
      else ...
    _ -> return []

unwindPath :: M.Map Location Location -> [Location] -> [Location]

The General Case

Now let's write out the steps for our general case.

Get the neighbors of the top element on the queue
Append these to the "rest" of the queue (discarding the top element).
Insert this top element into our "visited" set v.
For each new location, insert it into our "parents" map with the current top as its "parent".
Update our final state and recurse!

Each of these statements is 1-2 lines in our function, except we'll want to make a helper for the first line. Let's imagine we have a function that can give us the unvisited neighbors of a space in our grid. This will require passing the location, the grid, and the visited set.

let valid adjacent = getValidNeighbors top grid v
...

getValidNeighbors ::
  Location -> Grid -> Set.Set Location -> [Location]

The next lines involve data structure manipulation, with a couple tricky folds. First, appending the new elements into the queue.

let newQueue = foldr (flip (S.|>)) rest validAdjacent

Next, inserting the top into the visited set. This one's easy.

let newVisited = Set.insert top v

Now, insert each new neighbor into the parents map. The new location is the "key", and the current top is the value.

let newParentsMap = foldr (\loc -> M.insert loc top) p validAdjacent

Last of all, we replace the state and recurse!

put (BFSState newQueue newVisited newParentsMap)
bfsSearch' grid finish

Here's our complete function!

bfsSearch' :: Grid -> Location -> State BFSState [Location]
bfsSearch' grid finish = do
  (BFSState q v p) <- get
  case S.viewl q of
    (top S.:< rest) -> if top == finish
      then return (unwindPath p [finish])
      else do
        let validAdjacent = getValidNeighbors top grid v
        let newQueue = foldr (flip (S.|>)) rest validAdjacent
        let newVisited = Set.insert top v
        let newParentsMap = foldr (\loc -> M.insert loc top) p validAdjacent
        put (BFSState newQueue newVisited newParentsMap)
        bfsSearch' grid finish
    _ -> return []

Filling in Helpers

Now we just need to fill in our helper functions. Unwinding the map is a fairly straightforward tail-recursive problem. We get the parent of the current element, and keep an accumulating list of the places we've gone:

unwindPath :: M.Map Location Location -> [Location] -> [Location]
unwindPath parentsMap currentPath = case M.lookup (head currentPath) parentsMap of
  Nothing -> tail currentPath
  Just parent -> unwindPath parentsMap (parent : currentPath)

Finding the neighbors is slightly tricker. For each direction (right, down, left, and right), we have to consider if the "neighbor" cell is in bounds. Then we have to consider if it's empty. Finally, we need to know if it is still "unvisited". As long as all three of these conditions hold, we can potentially add it. Here's what this process looks like for finding the "right" neighbor.

getValidNeighbors :: Location -> Grid -> Set.Set Location -> [Location]
getValidNeighbors (r, c) grid v = ...
  where
    (rowMax, colMax) = snd . A.bounds $ grid
    right = (r, c + 1)
    right' = if c + 1 <= colMax && grid A.! right == Empty && not (Set.member right v)
      then Just right
      else Nothing

We do this in every direction, and we'll use catMaybes so we only get the correct ones in the end!

getValidNeighbors :: Location -> Grid -> Set.Set Location -> [Location]
getValidNeighbors (r, c) grid v = catMaybes [right', down', left', up']
  where
    (rowMax, colMax) = snd . A.bounds $ grid
    right = (r, c + 1)
    right' = if c + 1 <= colMax && grid A.! right == Empty && not (Set.member right v)
      then Just right
      else Nothing
    down = (r + 1, c)
    down' = if r + 1 <= rowMax && grid A.! down == Empty && not (Set.member down v)
      then Just down
      else Nothing
    left = (r, c - 1)
    left' = if c - 1 >= 0 && grid A.! left == Empty && not (Set.member left v)
      then Just left
      else Nothing
    up = (r - 1, c)
    up' = if r - 1 >= 0 && grid A.! up == Empty && not (Set.member up v)
      then Just up
      else Nothing

Conclusion

This basic structure can also be adapted to use depth-first search as well! The main difference is that you must treat the Sequence as a stack instead of a queue, appending new items to the left side of the sequence. Both of these algorithms are guaranteed to find a path if it exists. But BFS will find the shortest path in this kind of scenario, whereas DFS probably won't!

Next week, we'll continue a basic AI exploration by putting this algorithm to work in a game environment with Gloss!

James Bowen 6/29/20 James Bowen 6/29/20

Mid-Summer Break, Open AI Gym Series!

We're taking a little bit of a mid-summer break from new content here at MMH. But we have done some extra work in organizing the site! Last week we wrapped up our series on Haskell and the Open AI Gym. We've now added that series as a permanent fixture on the advanced section of the page!

Here's a quick summary of the series:

Part 1: Frozen Lake Primer

The first part introduces the Open AI framework and goes through the Frozen lake example. It presents the core concept of an environment.

Part 2: Frozen Lake in Haskell

In the second part, we write a basic version of Frozen Lake in Haskell.

Part 3: Blackjack

Next, we expand on our knowledge of games and environments to write a second game. This one based on casino Blackjack, and it will start to show us common elements in games.

Part 4: Q-Learning

Now we start getting into the ideas of reinforcement learning. We'll explore Q-Learning, one of the simplest techniques in this field. We'll apply this approach to both of our games.

Part 5: Generalized Environments

Now that we've seen the learning process in action, we can start generalizing our games. We'll create an abstract notion of what an Environment is. Just as Python has a specific API for their games, so will we! In true Haskell fashion, we'll represent this API with a type family!

Part 6: Q-Learning with Tensors in Python

In part 6, we'll take our Q-learning process a step further by using TensorFlow. We'll see how we can learn a more general function than we had before. We'll start this process in Python, where the mathematical operations are more clear.

Part 7: Q-Learning with Tensors in Haskell

Once we know how Q-Learning works with Python, we'll apply these techniques in Haskell as well! Once you get here, you'd better be ready to use your Haskell TensorFlow skills!

Part 8: Rendering with Gloss

In the final part of the series, we'll see how we can use the Gloss library to render our Haskell games!

You can take a look at the series summary page for more details!

In a couple weeks, we'll be back, this time with some fresh Rust content! Take a look at our Rust Video Tutorial to get a headstart on that!

James Bowen 6/22/20 James Bowen 6/22/20

Rendering Frozen Lake with Gloss!

We've spent the last few weeks exploring some of the ideas in the Open AI Gym framework. We made a couple games, generalized them, and applied some machine learning techniques. When it comes to rendering our games though, we're still relying on a very basic command line text format.

But if we want to design agents for more visually appealing games, we'll need a better solution! Last year, we spent quite a lot of time learning about the Gloss library. This library makes it easy to create simple games and render them using OpenGL. Take a look at this article for a summary of our work there and some links to the basics.

In this article, we'll explore how we can draw some connections between Gloss and our Open AI Gym work. We'll see how we can take the functions we've already written and use them within Gloss!

Gloss Basics

The key entrypoint for a Gloss game is the play function. At its core is the world type parameter, which we'll define for ourselves later.

play :: Display -> Color -> Int
  -> world
  -> (world -> Picture)
  -> (Event -> world -> world)
  -> (Float -> world -> world)
  -> IO ()

We won't go into the first three parameters. But the rest are important. The first is our initial world state. The second is our rendering function. It creates a Picture for the current state. Then comes an "event handler". This takes user input events and updates the world based on the actions. Finally there is the update function. This changes the world based on the passage of time, rather than specific user inputs.

This structure should sound familiar, because it's a lot like our Open AI environments! The initial world is like the "reset" function. Then both systems have a "render" function. And the update functions are like our stepEnv function.

The main difference we'll see is that Gloss's functions work in a pure way. Recall our "environment" functions use the "State" monad. Let's explore this some more.

Re-Writing Environment Functions

Let's take a look at the basic form of these environment functions, in the Frozen Lake context:

resetEnv :: (Monad m) => StateT FrozenLakeEnvironment m Observation
stepEnv :: (Monad m) =>
  Action -> StateT FrozenLakeEnvironment m (Observation, Double, Bool)
renderEnv :: (MonadIO m) => StateT FrozenLakeEnvironment m ()

These all use State. This makes it easy to chain them together. But if we look at the implementations, a lot of them don't really need to use State. They tend to unwrap the environment at the start with get, calculate new results, and then have a final put call.

This means we can rewrite them to fit more within Gloss's pure structure! We'll ignore rendering, since that will be very different. But here are some alternate type signatures:

resetEnv' :: FrozenLakeEnvironment -> FrozenLakeEnvironment
stepEnv' :: Action -> FrozenLakeEnvironment
  -> (FrozenLakeEnvironment, Double, Bool)

We'll exclude Observation as an output, since the environment contains that through currentObservation. The implementation for each of these looks like the original. Here's what resetting looks like:

resetEnv' :: FrozenLakeEnvironment -> FrozenLakeEnvironment
resetEnv' fle = fle
  { currentObservation = 0
  , previousAction = Nothing
  }

Now for stepping our environment forward:

stepEnv' :: Action -> FrozenLakeEnvironment -> (FrozenLakeEnvironment, Double, Bool)
stepEnv' act fle = (finalEnv, reward, done)
  where
    currentObs = currentObservation fle
    (slipRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
    allLegalMoves = legalMoves currentObs (dimens fle)
    numMoves = length allLegalMoves - 1
    (randomMoveIndex, finalGen) = randomR (0, numMoves) gen'
    newObservation = ... -- Random move, or apply the action
    (done, reward) = case (grid fle) A.! newObservation of
      Goal -> (True, 1.0)
      Hole -> (True, 0.0)
      _ -> (False, 0.0)
    finalEnv = fle
      { currentObservation = newObservation
      , randomGenerator = finalGen
      , previousAction = Just act
      }

What's even better is that we can now rewrite our original State functions using these!

resetEnv :: (Monad m) => StateT FrozenLakeEnvironment m Observation
resetEnv = do
  modify resetEnv'
  gets currentObservation

stepEnv :: (Monad m) =>
  Action -> StateT FrozenLakeEnvironment m (Observation, Double, Bool)
stepEnv act = do
  fle <- get
  let (finalEnv, reward, done) = stepEnv' act fle
  put finalEnv
  return (currentObservation finalEnv, reward, done)

Implementing Gloss

Now let's see how this ties in with Gloss. It might be tempting to use our Environment as the world type. But it can be useful to attach other information as well. For one example, we can also include the current GameResult, telling us if we've won, lost, or if the game is still going.

data GameResult =
  GameInProgress |
  GameWon |
  GameLost
  deriving (Show, Eq)

data World = World
  { environment :: FrozenLakeEnvironment
  , gameResult :: GameResult
  }

Now we can start building the other pieces of our game. There aren't really any "time" updates in our game, except to update the result based on our location:

updateWorldTime :: Float -> World -> World
updateWorldTime _ w = case tile of
  Goal -> World fle GameWon
  Hole -> World fle GameLost
  _ -> w
  where
    fle = environment w
    obs = currentObservation fle
    tile = grid fle A.! obs

When it comes to handling inputs, we need to start with the case of restarting the game. When the game isn't InProgress, only the "enter" button matters. This resets everything, using resetEnv':

handleInputs :: Event -> World -> World
handleInputs event w
  | gameResult w /= GameInProgress = case event of
      (EventKey (SpecialKey KeyEnter) Down _ _) ->
        World (resetEnv' fle) GameInProgress
      _ -> w
  ...

Now we handle each directional input key. We'll make a helper function at the bottom that does the business of calling stepEnv'.

handleInputs :: Event -> World -> World
handleInputs event w
  | gameResult w /= GameInProgress = case event of
      (EventKey (SpecialKey KeyEnter) Down _ _) ->
        World (resetEnv' fle) GameInProgress
  | otherwise = case event of
      (EventKey (SpecialKey KeyUp) Down _ _) ->
        w {environment = finalEnv MoveUp }
      (EventKey (SpecialKey KeyRight) Down _ _) ->
        w {environment = finalEnv MoveRight }
      (EventKey (SpecialKey KeyDown) Down _ _) ->
        w {environment = finalEnv MoveDown }
      (EventKey (SpecialKey KeyLeft) Down _ _) ->
        w {environment = finalEnv MoveLeft }
      _ -> w
  where
    fle = environment w
    finalEnv action =
      let (fe, _, _) = stepEnv' action fle
      in  fe

The last step is rendering the environment with a draw function. This just requires a working knowledge of constructing the Picture type in Gloss. It's a little tedious, so I've included the full implementation as an appendix at the bottom. We can then combine all these pieces like so:

main :: IO ()
main = do
  env <- basicEnv
  play windowDisplay white 20
    (World env GameInProgress)
    drawEnvironment
    handleInputs
    updateWorldTime

After we have all these pieces, we can run our game, moving our player around to reach the green tile while avoiding the black tiles!

Conclusion

With a little more plumbing, it would be possible to combine this with the rest of our "Environment" work. There are some definite challenges. Our current environment setup doesn't have a "time update" function. Combining machine learning with Gloss rendering would also be interesting. This is the end of our Open Gym series for now, but I'll definitely be working on this project more in the future! Next week we'll have a summary and review what we've learned!

Take a look at our Github repository to see all the code we wrote in this series! The code for this article is on the gloss branch. And don't forget to Subscribe to Monday Morning Haskell to get our monthly newsletter!

Appendix: Rendering Frozen Lake

A lot of numbers here are hard-coded for a 4x4 grid, where each cell is 100x100. Notice particularly that we have a text message if we've won or lost.

windowDisplay :: Display
windowDisplay = InWindow "Window" (400, 400) (10, 10)

drawEnvironment :: World -> Picture
drawEnvironment world
  | gameResult world == GameWon = Translate (-150) 0 $ Scale 0.12 0.25
      (Text "You've won! Press enter to restart!")
  | gameResult world == GameLost = Translate (-150) 0 $ Scale 0.12 0.25
      (Text "You've lost :( Press enter to restart.")
  | otherwise = Pictures [tiles, playerMarker]
  where
    observationToCoords :: Word -> (Word, Word)
    observationToCoords w = quotRem w 4

    renderTile :: (Word, TileType) -> Picture
    renderTile (obs, tileType ) =
      let (centerX, centerY) = rowColToCoords . observationToCoords $ obs
          color' = case tileType of
            Goal -> green
            Hole -> black
            _ -> blue
       in Translate centerX centerY (Color color' (Polygon [(-50, -50), (-50, 50), (50, 50), (50, -50)]))

    tiles = Pictures $ map renderTile (A.assocs (grid . environment $ world))

    (px, py) = rowColToCoords . observationToCoords $ (currentObservation . environment $ world)
    playerMarker = translate px py (Color red (ThickCircle 10 3))

rowColToCoords :: (Word, Word) -> (Float, Float)
rowColToCoords (row, col) = (100 * (fromIntegral col - 1.5), 100 * (1.5 - fromIntegral row))

James Bowen 6/15/20 James Bowen 6/15/20

Training our Agent with Haskell!

In the previous part of the series, we used the ideas of Q-Learning together with TensorFlow. We got a more general solution to our agent that didn't need a table for every state of the game.

This week, we'll take the final step and implement this TensorFlow approach in Haskell. We'll see how to integrate this library with our existing Environment system. It works out quite smoothly, with a nice separation between our TensorFlow logic and our normal environment logic!

This article requires a working knowledge of the Haskell TensorFlow integration. If you're new to this, you should download our Guide showing how to work with this framework. You can also read our original Machine Learning Series for some more details! In particular, the second part will go through the basics of tensors.

Building Our TF Model

The first thing we want to do is construct a "model". This model type will store three items. The first will be the tensor for the weights we have. Then the second two will be functions in the TensorFlow Session monad. The first function will provide scores for the different moves in a position, so we can choose our move. The second will allow us to train the model and update the weights.

data Model = Model
  {  weightsT :: Variable Float
  , chooseActionStep :: TensorData Float -> Session (Vector Float)
  , learnStep :: TensorData Float -> TensorData Float -> Session ()
  }

The input for choosing an action is our world observation state, converted to a Float and put in a size 16-vector. The result will be 4 floating point values for the scores. Then our learning step will take in the observation as well as a set of 4 values. These are the "target" values we're training our model on.

We can construct our model within the Session monad. In the first part of this process we define our weights and use them to determine the score of each move (results).

createModel :: Session Model
createModel = do
  -- Choose Action
  inputs <- placeholder (Shape [1, 16])
  weights <- truncatedNormal (vector [16, 4]) >>= initializedVariable
  let results = inputs `matMul` readValue weights
  returnedOutputs <- render results
  ...

Now we make our "trainer". Our "loss" function is the reduced, squared difference between our results and the "target" outputs. We'll use the adam optimizer to learn values for our weights to minimize this loss.

createModel :: Session Model
createModel = do
  -- Choose Action
  ...

  -- Train Nextwork
  (nextOutputs :: Tensor Value Float) <- placeholder (Shape [4, 1])
  let (diff :: Tensor Build Float) = nextOutputs `sub` results
  let (loss :: Tensor Build Float) = reduceSum (diff `mul` diff)
  trainer_ <- minimizeWith adam loss [weights]
  ...

Finally, we wrap these tensors into functions we can call using runWithFeeds. Recall that each feed provides us with a way to fill in one of our placeholder tensors.

createModel :: Session Model
createModel = do
  -- Choose Action
  ...

  -- Train Network
  ...

  -- Create Model
  let chooseStep = \inputFeed ->
        runWithFeeds [feed inputs inputFeed] returnedOutputs
  let trainStep = \inputFeed nextOutputFeed ->
        runWithFeeds [ feed inputs inputFeed
                     , feed nextOutputs nextOutputFeed
                     ]
                     trainer_
  return $ Model weights chooseStep trainStep

Our model now wraps all the different tensor operations we need! All we have to do is provide it with the correct TensorData. To see how that works, let's start integrating with our EnvironmentMonad!

Integrating With Environment

Our model's functions exist within the TensorFlow monad Session. So how then, do we integrate this with our existing Environment code? The answer is, of course, to construct a new monad! This monad will wrap Session, while still giving us our FrozenLakeEnvironment! We'll keep the environment within a State, but we'll also keep a reference to our Model.

newtype FrozenLake a = FrozenLake
  (StateT (FrozenLakeEnvironment, Model) Session a)
  deriving (Functor, Applicative, Monad)

instance (MonadState FrozenLakeEnvironment) FrozenLake where
  get = FrozenLake (fst <$> get)
  put fle = FrozenLake $ do
    (_, model) <- get
    put (fle, model)

Now we can start implementing the actual EnvironmentMonad instance. Most of our existing types and functions will work with trivial modification. The only real change is that runEnv will need to run a TensorFlow session and create the model. Then it can use evalStateT.

instance EnvironmentMonad FrozenLake where
  type (Observation FrozenLake) = FrozenLakeObservation
  type (Action FrozenLake) = FrozenLakeAction
  type (EnvironmentState FrozenLake) = FrozenLakeEnvironment
  baseEnv = basicEnv
  currentObservation = currentObs <$> get
  resetEnv = resetFrozenLake
  stepEnv = stepFrozenLake
  runEnv env (FrozenLake action) = runSession $ do
    model <- createModel
    evalStateT action (env, model)

This is all we need to define the first class. But, with TensorFlow, our environment is only useful if we use the tensor model! This means we need to fill in LearningEnvironment as well. This has two functions, chooseActionBrain and learnEnv using our tensors. Let's see how that works.

Choosing an Action

Choosing an action is straightforward. We'll once again start with the same format for sometimes choosing a random move:

chooseActionTensor :: FrozenLake FrozenLakeAction
chooseActionTensor = FrozenLake $ do
  (fle, model) <- get
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
  if exploreRoll < flExplorationRate fle
    then do
      let (actionRoll, gen'') = Rand.randomR (0, 3) gen'
      put $ (fle { randomGenerator = gen'' }, model)
      return (toEnum actionRoll)
    else do
      ...

As in Python, we'll need to convert an observation to a tensor type. This time, we'll create TensorData. This type wraps a vector, and our input should have the size 1x16. It has the format of a oneHot tensor. But it's easier to make this a pure function, rather than using a TensorFlow monad.

obsToTensor :: FrozenLakeObservation -> TensorData Float
obsToTensor obs = encodeTensorData (Shape [1, 16]) (V.fromList asList)
  where
    asList = replicate (fromIntegral obs) 0.0 ++ 
               [1.0] ++
               replicate (fromIntegral (15 - obs)) 0.0

Since we've already defined our chooseAction step within the model, it's easy to use this! We convert the current observation, get the result values, and then pick the best index!

chooseActionTensor :: FrozenLake FrozenLakeAction
chooseActionTensor = FrozenLake $ do
  (fle, model) <- get
  -- Random move
  ...
    else do
      let obs1 = currentObs fle
      let obs1Data = obsToTensor obs1

      -- Use model!
      results <- lift ((chooseActionStep model) obs1Data)
      let bestMoveIndex = V.maxIndex results
      put $ (fle { randomGenerator = gen' }, model)
      return (toEnum bestMoveIndex)

Learning From the Environment

One unfortunate part of our current design is that we have to repeat some work in our learning function. To learn from our action, we need to use all the values, not just the chosen action. So to start our learning function, we'll call chooseActionStep again. This time we'll get the best index AND the max score.

learnTensor ::
  FrozenLakeObservation -> FrozenLakeObservation ->
  Reward -> FrozenLakeAction ->
  FrozenLake ()
learnTensor obs1 obs2 (Reward reward) action = FrozenLake $ do
  model <- snd <$> get
  let obs1Data = obsToTensor obs1

  -- Use the model!
  results <- lift ((chooseActionStep model) obs1Data)
  let (bestMoveIndex, maxScore) =
        (V.maxIndex results, V.maximum results)
  ...

We can now get our "target" values by substituting in the reward and max score at the proper index. Then we convert the second observation to a tensor, and we have all our inputs to call our training step!

learnTensor ::
  FrozenLakeObservation -> FrozenLakeObservation ->
  Reward -> FrozenLakeAction ->
  FrozenLake ()
learnTensor obs1 obs2 (Reward reward) action = FrozenLake $ do
  ...
  let (bestMoveIndex, maxScore) =
        (V.maxIndex results, V.maximum results)
  let targetActionValues = results V.//
        [(bestMoveIndex, double2Float reward + (gamma * maxScore))]
  let obs2Data = obsToTensor obs2
  let targetActionData = encodeTensorData
        (Shape [4, 1])
        targetActionValues

  -- Use the model!
  lift $ (learnStep model) obs2Data targetActionData

  where
    gamma = 0.81

Using these two functions, we can now fill in our LearningEnvironment class!

instance LearningEnvironment FrozenLake where
  chooseActionBrain = chooseActionTensor
  learnEnv = learnTensor
  -- Same as before
  explorationRate = ..
  reduceExploration = ...

We'll then be able to run this code just as we would our other Q-learning examples!

Conclusion

This wraps up the machine learning part of this series. We'll have one more article about Open Gym next week. We'll compare our current setup and the Gloss library. Gloss offers much more extensive possibilities for rendering our game and accepting input. So using it would expand the range of games we could play!

We'll definitely continue to expand on the Open Gym concept in the future! Expect a more formal approach to this at some point! For now, take a look at our Github repository for this series! This article's code is on the tensorflow branch!

James Bowen 6/8/20 James Bowen 6/8/20

Q-Learning with Tensors

In our last article we finished refactoring our Gym code to use a type family. This would make it much easier to add new games to our framework in the future. We're now in the closing stages of this series on AI and agent development. This week we're going to incorporate TensorFlow and perform some more advanced techniques.

We've used Q-Learning to train some agents to play simple games like Frozen Lake and Blackjack. Our existing approach uses an exhaustive table from observations to expected rewards. But in most games we won't be able to construct such an exhaustive table. The observation space will be too large, or it will be continuous. So in this article, we're going to explore how to use TensorFlow to build a more generic function we can learn. We'll start this process in Python, where there's a bit less overhead.

Next up, we'll be using TensorFlow with our Haskell code. We'll explore an alternative form of our FrozenLake monad using this approach. To make sure you're ready for it, download our Haskell TensorFlow Guide.

A Q-Function

Our goal here will be to make a more general Q-Function, instead of using a table. A Q-Function provides another way of writing our chooseAction function. With the table approach, each of the 16 possible observations had 4 scores, one for each of the actions we can take. To choose an action, we just take the index with the highest score.

We now want to incorporate a simple neural network for chooseAction. In our example, this network will consist of a single matrix of weights. The input to our network will be a vector of size 16. This vector will have all zeroes, except for the index of the current observation, which will be 1. Then the output of the network will be a vector of size 4. These will give the scores for each move from that observation. So our "weights" will have size 16x4.

So one useful helper function we can write already will be to convert an observation to an input tensor. This will make use of the identity matrix.

def obs_to_tensor(obs):
  return np.identity(16)[obs:obs+1]

Building the Graph

We can now go ahead and start building our tensor graph. We'll start with the part that makes moves from an observation. For this quick Python script, we'll let the tensors live in the global namespace.

import gym
import numpy as np
import tensorflow as tf

tf.reset_default_graph()
env = gym.make('FrozenLake-v0')

inputs = tf.placeholder(shape=[1,16], dtype=tf.float32)
weights = tf.Variable(tf.random_uniform([16, 4], 0, 0.01))
output = tf.matmul(inputs, weights)
prediction = tf.argmax(output, 1)

Each time we make a move, we'll pass the current observation tensor as the input placeholder. Then we multiply it by the weights to get scores for each different output action. Our final "prediction" is the output index with the highest weight. Notice how we initialize our network with random weights. This helps prevent our network from getting stuck early on.

We can use these tensors to construct our choose_action function. This will, of course take the current observation as an input. But it will also take an epsilon value for the random move probability. We use sess.run to run our prediction and output tensors. If we choose a random move instead, we'll replace the actual "action" with a sample from the action space.

def choose_action(input_obs, epsilon):
  action, all_outputs = sess.run(
    [prediction, output],
    feed_dict={inputs: obs_to_tensor(input_obs)})
  if np.random.rand(1) < epsilon:
    action[0] = env.action_space.sample()
  return action, all_outputs

The Learning Process

The first part of our graph tells us how to make moves, but we also need to update our weights so the network gets better! To do this, we'll add a few more tensors.

next_output = tf.placeholder(shape=[1,4], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_output - output))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)

init = tf.initialize_all_variables()

Let's go through these one-by-one. We need to take an extra input for the target values, which incorporate the "next" state of the game. We want the values we get in the original state to be closer to those! So our "loss" function is the squared difference of our "current" output and the "target" output. Then we create a "trainer" that minimizes the loss function. Because our weights are the "variable" in the system, they'll get updated to minimize this loss.

We can use this section group of tensors to construct our "learning" function.

def learn_env(current_obs, next_obs, reward, action, all_outputs):
  gamma = 0.81
  _, all_next_outputs = choose_action(next_obs, 0.0)
  next_max = np.max(all_next_outputs)
  target_outputs = all_outputs
  target_outputs[0, action[0]] = reward + gamma * next_max
  sess.run(
    [update_model, weights],
    feed_dict={inputs: obs_to_tensor(current_obs),
               next_output: target_outputs})

We start by choosing an action from the "next" position (without randomness). We get the largest value from that choice. We use this and the reward to inform our "target" of what the current input weights should be. In other words, taking our action should give us the reward and the best value we would get from the next position. Then we update our model!

Playing the Game

Now all that's left is to play out the game! This looks a lot like code from previous parts, so we won't go into too much depth. The key section is in the middle of the loop. We choose our next action, use it to step the environment, and use the reward to learn.

rewards_list = []

with tf.Session() as sess:
  sess.run(init)
  epsilon = 0.9
  decay_rate = 0.9
  num_episodes = 10000
  for i in range(num_episodes):
      # Reset environment and get first new observation
      current_obs = env.reset()
      sum_rewards = 0
      done = False
      num_steps = 0
      while num_steps < 100:
        num_steps += 1

        # Choose, Step, Learn!
        action, all_outputs = choose_action(current_obs, epsilon)
        next_obs, reward, done, _ = env.step(action[0])
        learn_env(current_obs, next_obs, reward, action, all_outputs)

        sum_rewards += reward
        current_obs = next_obs
        if done == True:
          if i % 100 == 99:
            epsilon *= decay_rate
          break
    rewards_list.append(sum_rewards)

Our results won't be quite as good as the table approach. Using a tensor function allows our system to be a lot more general. But the consequence of this is that the results aren't stable. We could, of course, improve the results by using more advanced algorithms. But we'll get into that another time!

Conclusion

Now that we know the core ideas behind using tensors for Q-Learning, it's time to do this in Haskell. Next week, we'll do a refresher on how Haskell operates together with Tensor Flow. We'll see how we can work these ideas into our existing Environment framework.

James Bowen 6/1/20 James Bowen 6/1/20

Refactored Game Play!

Last week, we implemented Q-learning for our Blackjack game. We found the solution looked a lot like Frozen Lake for the most part. So we created a new class EnvironmentMonad to combine the steps these games have in common. This week, we'll see a full implementation of that class. Our goal is a couple generic gameLoop functions we can use for different modes of our game.

As always, the code for this article is on our Github repository! You'll mainly want to explore any of the source files with Environment in their name.

Expanding our Environment

Last time, we put together a basic idea of what a generic environment could look like. We made a couple separate "sub-classes" as well, for rendering and learning.

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  resetEnv :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)

class (MonadIO m, EnvironmentMonad m) => RenderableEnvironment m where
  renderEnv :: m ()

class (EnvironmentMonad m) => LearningEnvironment m where
  learnEnv ::
    (Observation m) -> (Observation m) -> Reward -> (Action m) -> m ()

There are still a couple extra pieces we can add that will make these classes more complete. One thing we're missing here is a concrete expression of our state. This makes it difficult to run our environments from normal code. So let's add a new type to the family for our "Environment" type, as well as a function to "run" that environment. We'll also want a generic way to get the current observation.

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  type EnvironmentState m :: *
  runEnv :: (EnvironmentState m) -> m a -> IO a
  currentObservation :: m (Observation m)
  resetEnv :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)

Forcing run to use IO is more restrictive than we'd like. In the future we might explore how to get our environment to wrap a monad parameter to fix this.

We can also add a couple items to our LearningEnvironment for the exploration rate. This way, we don't need to do anything concrete to affect the learning process. We'll also make the function for choosing an action is a specific part of the environment.

class (EnvironmentMonad m) => LearningEnvironment m where
  learnEnv ::
    (Observation m) -> (Observation m) -> Reward -> (Action m) -> m ()
  chooseActionBrain :: m (Action m)
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()

Game Loops

In previous iterations, we had gameLoop functions for each of our different environments. We can now write these in a totally generic way! Here's a simple loop that plays the game once and produces a result:

gameLoop :: (EnvironmentMonad m) =>
  m (Action m) -> m (Observation m, Reward)
gameLoop chooseAction = do
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then return (newObs, reward)
    else gameLoop chooseAction

If we want to render the game between moves, we add a single renderEnv call before selecting the move. We also need an extra IO constraint and to render it before returning the final result.

gameRenderLoop :: (RenderableEnvironment m) =>
  m (Action m) -> m (Observation m, Reward)
gameRenderLoop chooseAction = do
  renderEnv
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then renderEnv >> return (newObs, reward)
    else gameRenderLoop chooseAction

Finally, there are a couple different loops we can write for a learning environment. We can have a generic loop for one iteration of the game. Notice how we rely on the class function chooseActionBrain. This means we don't need such a function as a parameter.

gameLearningLoop :: (LearningEnvironment m) =>
  m (Observation m, Reward)
gameLearningLoop = do
  oldObs <- currentObservation
  newAction <- chooseActionBrain
  (newObs, reward, done) <- stepEnv newAction
  learnEnv oldObs newObs reward newAction
  if done
    then return (newObs, reward)
    else gameLearningLoop

Then we can make another loop that runs many learning iterations. We reduce the exploration rate at a reasonable interval.

gameLearningIterations :: (LearningEnvironment m) => m [Reward]
gameLearningIterations = forM [1..numEpisodes] $ \i -> do
  resetEnv
  when (i `mod` 100 == 99) $ do
    reduceExploration decayRate minEpsilon
  (_, reward) <- gameLearningLoop
  return reward
  where
    numEpisodes = 10000
    decayRate = 0.9
    minEpsilon = 0.01

Concrete Implementations

Now we want to see how we actually implement these classes for our types. We'll show the examples for FrozenLake but it's an identical process for Blackjack. We start by defining the monad type as a wrapper over our existing state.

newtype FrozenLake a = FrozenLake (StateT FrozenLakeEnvironment IO a)
  deriving (Functor, Applicative, Monad)

We'll want to make a State instance for our monads over the environment type. This will make it easier to port over our existing code. We'll also need a MonadIO instance to help with rendering.

instance (MonadState FrozenLakeEnvironment) FrozenLake where
  get = FrozenLake get
  put fle = FrozenLake $ put fle

instance MonadIO FrozenLake where
  liftIO act = FrozenLake (liftIO act)

Then we want to change our function signatures to live in the desired monad. We can pretty much leave the functions themselves untouched.

resetFrozenLake :: FrozenLake FrozenLakeObservation

stepFrozenLake ::
  FrozenLakeAction -> FrozenLake (FrozenLakeObservation, Reward, Bool)

renderFrozenLake :: FrozenLake ()

Finally, we make the actual instance for the class. The only thing we haven't defined yet is the runEnv function. But this is a simple wrapper for evalStateT.

instance EnvironmentMonad FrozenLake where
  type (Observation FrozenLake) = FrozenLakeObservation
  type (Action FrozenLake) = FrozenLakeAction
  type (EnvironmentState FrozenLake) = FrozenLakeEnvironment
  baseEnv = basicEnv
  runEnv env (FrozenLake action) = evalStateT action env
  currentObservation = FrozenLake (currentObs <$> get)
  resetEnv = resetFrozenLake
  stepEnv = stepFrozenLake

instance RenderableEnvironment FrozenLake where
  renderEnv = renderFrozenLake

There's a bit more we could do. We could now separate the "brain" portions of the environment without any issues. We wouldn't need to keep the Q-Table and the exploration rate in the state. This would improve our encapsulation. We could also make our underlying monads more generic.

Playing the Game

Now, playing our game is simple! We get our basic environment, reset it, and call our loop function! This code will let us play one iteration of Frozen Lake, using our own input:

main :: IO ()
main = do
  (env :: FrozenLakeEnvironment) <- basicEnv
  _ <- runEnv env action
  putStrLn "Done!"
  where
    action = do
      resetEnv
      (gameRenderLoop chooseActionUser
        :: FrozenLake (FrozenLakeObservation, Reward))

Once again, we can make this code work for Blackjack with a simple name substitution.

We can also make this work with our Q-learning code as well. We start with a simple instance for LearningEnvironment.

instance LearningEnvironment FrozenLake where
  learnEnv = learnQTable
  chooseActionBrain = chooseActionQTable
  explorationRate = flExplorationRate <$> get
  reduceExploration decayRate minEpsilon = do
    fle <- get
    let e = flExplorationRate fle
    let newE = max minEpsilon (e * decayRate)
    put $ fle { flExplorationRate = newE }

And now we use gameLearningIterations instead of gameRenderLoop!

main :: IO ()
main = do
  (env :: FrozenLakeEnvironment) <- basicEnv
  _ <- runEnv env action
  putStrLn "Done!"
  where
    action = do
      resetEnv
      (gameLearningIterations :: FrozenLake [Reward])

Conclusion

We're still pulling in two "extra" pieces besides the environment class itself. We still have specific implementations for basicEnv and action choosing. We could try to abstract these behind the class as well. There would be generic functions for choosing the action as a human and choosing at random. This would force us to make the action space more general as well.

But for now, it's time to explore some more interesting learning algorithms. For our current Q-learning approach, we make a table with an entry for every possible game state. This doesn't scale to games with large or continuous observation spaces! Next week, we'll see how TensorFlow allows us to learn a Q function instead of a direct table.

We'll start in Python, but soon enough we'll be using TensorFlow in Haskell. Take a look at our guide for help getting everything installed!

James Bowen 5/25/20 James Bowen 5/25/20

Generalizing Our Environments

In our previous episode, we used Q-Learning to find a solution for the Frozen Lake scenario. We also have a Blackjack game that shares a lot of core ideas with Frozen Lake.

So in this part, we're going to start by applying our Q-Learning solution to the Blackjack game. This will highlight the similarities in the code between the two games. But we'll also see a few differences. The similarities will lead us to create a typeclass for our environment concept. Each "difference" in the two systems will suggest an expression that must be part of the class. Let's explore the implications of this.

Adding to the Environment

Once again, we will need to express our Q-table and the exploration rate as part of the environment. But this time, the index of our Q-Table will need to be a bit more complex. Remember our observation now has three different parts: the user's score, whether the player has an ace, and the dealer's show-card. We can turn each of these into a Word, and combine them with the action itself. This gives us an index with four Word values.

We want to populate this array with bounds to match the highest value in each of those fields.

data BlackjackEnvironment = BlackjackEnvironment
  { ...
  , qTable :: A.Array (Word, Word, Word, Word) Double
  , explorationRate :: Double
  } deriving (Show)

basicEnv :: IO BlackjackEnvironment
basicEnv = do
  gen <- Rand.getStdGen
  let (d, newGen) = shuffledDeck gen
  return $ BlackjackEnvironment
    ...
    (A.listArray ((0,0,0,0), (30, 1, 12, 1)) (repeat 0.0))
    1.0

While we're at it, let's create a function to turn an Observation/Action combination into an index.

makeQIndex :: BlackjackObservation -> BlackjackAction
  -> (Word, Word, Word, Word)
makeQIndex (BlackjackObservation pScore hasAce dealerCard) action =
  ( pScore
  , if hasAce then 1 else 0
  , fromIntegral . fromEnum $ dealerCard
  , fromIntegral . fromEnum $ action
  )

With the help of this function, it's pretty easy to re-use most of our code from last time! The action choice function and the learning function look almost the same! So review last week's article (or the code on Github) for details.

Using the Same Game Loop

With our basic functions out of the way, let's now turn our attention to the game loop and running functions. For the game loop, we don't have anything too complicated. It's a step-by-step process.

Retrieve the current observation
Choose the next action
Use this action to step the environment
Use our "learning" function to update the Q-Table
If we're done, return the reward. Otherwise recurse.

Here's what it looks like. Recall that we're taking our action choice function as an input. All our functions live in a similar monad, so this is pretty easy.

gameLoop :: (MonadIO m) =>
  StateT BlackjackEnvironment m BlackjackAction ->
  StateT BlackjackEnvironment m (BlackjackObservation, Double)
gameLoop chooseAction = do
  oldObs <- currentObservation <$> get
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  learnQTable oldObs newObs reward newAction
  if done
    then do
      if reward > 0.0
        then liftIO $ putStrLn "Win"
        else liftIO $ putStrLn "Lose"
      return (newObs, reward)
    else gameLoop chooseAction

Now to produce our final output and run game iterations, we need a little wrapper code. We create (and reset) our initial environment. Then we pass it to an action that runs the game loop and reduces the exploration rate when necessary.

playGame :: IO ()
playGame = do
  env <- basicEnv
  env' <- execStateT resetEnv env
  void $ execStateT stateAction env'
  where
    numEpisodes = 10000
    decayRate = 1.0
    minEpsilon = 0.01

    stateAction :: StateT BlackjackEnvironment IO ()
    stateAction = do
      rewards <- forM [1..numEpisodes] $ \i -> do
        resetEnv
        when (i `mod` 100 == 99) $ do
          bje <- get
          let e = explorationRate bje
          let newE = max minEpsilon (e * decayRate)
          put $ bje { explorationRate = newE }
        (_, reward) <- gameLoop chooseActionQTable
        return reward
      lift $ print (sum rewards)

Now we can play our game! Even with learning, we'll still only get around 40% of the points available. Blackjack is a tricky, luck-based game, so this isn't too surprising.

Constructing a Class

Now if you look very carefully at the above code, it should almost work for Frozen Lake as well! We'd only need to make a few adjustments to naming types. This tells us we have a general structure between our different games. And we can capture that structure with a class.

Let's look at the common elements between our environments. These are all functions we call from the game loop or runner:

Resetting the environment
Stepping the environment (with an action)
Rendering the environment (if necessary)
Apply some learning method on the new data
Diminish the exploration rate

So our first attempt at this class might look like this, looking only at the most important fields:

class Environment e where
  resetEnv :: (Monad m) => StateT e m Observation
  stepEnv :: (Monad m) => Action
    -> StateT e m (Observation, Double, Bool)
  renderEnv :: (MonadIO m) => StateT e m ()
  learnEnv :: (Monad m) =>
    Observation -> Observation -> Double -> Action -> StateT e m ()

instance Environment FrozenLakeEnvironment where
  ...

instance Environment BlackjackEnvironment where
  ...

We can make two clear observations about this class. First, we need to generalize the Observation and Action types! These are different in our two games and this isn't reflected above. Second, we're forcing ourselves to use the State monad over our environment. This isn't necessarily wise. It might force us to add extra fields to the environment type that don't belong there.

The solution to the first issue is to make this class a type family! Then we can associate the proper data types for observations and actions. The solution to the second issue is that our class should be over a monad instead of the environment itself.

Remember, a monad provides the context in which a computation takes place. So in our case, our game, with all its stepping and learning, is that context!

Doing this gives us more flexibility for figuring out what data should live in which types. It makes it easier to separate the game's internal state from auxiliary state, like the exploration rate.

Here's our second try, with associated types and a monad.

newtype Reward = Reward Double

class (MonadIO m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  resetEnv :: m (Observation m)
  currentObservation :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)
  renderEnv :: m ()
  learnEnv :: 
    (Observation m) -> (Observation m) ->
    Reward -> (Action m) -> m () 
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()

There are a couple undesirable parts of this. Our monad has to be IO to account for rendering. But it's possible for us to play the game without needing to render. In fact, it's also possible for us to play the game without learning!

So we can separate this into more typeclasses! We'll have two "subclasses" of our Environment. We'll make a separate class for rendering. This will be the only class that needs an IO constraint. Then we'll have a class for learning functionality. This will allow us to "run" the game in different contexts and limit the reach of these effects.

newtype Reward = Reward Double

class (Monad m) => EnvironmentMonad m where
  type Observation m :: *
  type Action m :: *
  currentObservation :: m (Observation m) 
  resetEnv :: m (Observation m)
  stepEnv :: (Action m) -> m (Observation m, Reward, Bool)

class (MonadIO m, EnvironmentMonad m) => 
  RenderableEnvironment m where
    renderEnv :: m ()

class (EnvironmentMonad m) => LearningEnvironment m where
  learnEnv ::
    (Observation m) -> (Observation m) ->
    Reward -> (Action m) -> m () 
  explorationRate :: m Double
  reduceExploration :: Double -> Double -> m ()

Conclusion

Next week we'll explore how to implement these classes for our different games! We'll end up with a totally generic function for playing the game. We'll have a version with learning and a version without!

The next step after this will be to attach more sophisticated learning mechanisms. Soon, we'll explore how to expand our Q-Learning beyond simple discrete states. The way to do this is to use tensors! So in a couple weeks, we'll explore how to use TensorFlow to construct a function for Q-Learning. To get ready, download our Haskell TensorFlow Guide!

James Bowen 5/18/20 James Bowen 5/18/20

Frozen Lake with Q-Learning!

In the last few weeks, we've written two simple games in Haskell: Frozen Lake and Blackjack. These games are both toy examples from the Open AI Gym. Now that we've written the games, it's time to explore more advanced ways to write agents for them.

In this article, we'll explore the concept of Q-Learning. We've talked about this idea on the MMH blog before. But now we'll see it in action in a simpler context than we did before. We'll write a little bit of Python code, following some examples for Frozen Lake. Then we'll try to implement the same ideas in Haskell. Along the way, we'll see more patterns emerge about our games' interfaces.

We won't be using Tensorflow in the article. But we'll soon explore ways to augment our agent's capabilities with this library! To learn about Haskell and Tensorflow, download our TensorFlow guide!

Making a Q-Table

Let's start by taking a look at this basic Python implementation of Q-Learning for Frozen Lake. This will show us the basic ideas of Q-Learning. We start out by defining a few global parameters, as well as Q, a variable that will hold a table of values.

epsilon = 0.9
min_epsilon = 0.01
decay_rate = 0.9
Total_episodes = 10000
max_steps = 100
learning_rate = 0.81
gamma = 0.96

env = gym.make('FrozenLake-v0')
Q = numpy.zeros((env.observation_space.n, env.action_space.n))

Recall that our environment has an action space and an observation space. For this basic version of the Frozen Lake game, an observation is a discrete integer value from 0 to 15. This represents the location our character is on. Then the action space is an integer from 0 to 3, for each of the four directions we can move. So our "Q-table" will be an array with 16 rows and 4 columns.

How does this help us choose our move? Well, each cell in this table has a score. This score tells us how good a particular move is for a particular observation state. So we could define a choose_action function in a simple way like so:

def choose_action(observation):
  return numpy.argmax(Q[observation, :])

This will look at the different values in the row for this observation, and choose the highest index. So if the "0" value in this row is the highest, we'll return 0, indicating we should move left. If the second value is highest, we'll return 1, indicating a move down.

But we don't want to choose our moves deterministically! Our Q-Table starts out in the "untrained" state. And we need to actually find the goal at least once to start back-propagating rewards into our maze. This means we need to build some kind of exploration into our system. So each turn, we can make a random move with probability epsilon.

def choose_action(observation):
  action = 0
  if np.random.uniform(0, 1) < epsilon:
    action = env.action_space.sample()
  else:
    action = numpy.argmax(Q[observation, :])
  return action

As we learn more, we'll diminish the exploration probability. We'll see this below!

Updating the Table

Now, we also want to be able to update our table. To do this, we'll write a function that follows the Q-learning rule. It will take two observations, the reward for the second observation, and the action we took to get there.

def learn(observation, observation2, reward, action):
  prediction = Q[observation, action]
  target = reward + gamma * numpy.max(Q[observation2, :])
  Q[observation, action] = Q[observation, action] +
                              learning_rate * (target - prediction)

For more details on what happens here, read our Q-Learning primer. But there's one general rule.

Suppose we move from Observation O1 to Observation O2 with action A. We want the Q-table value for the pair (O1, A) to be closer to the best value we can get from O2. And we want to factor in the potential reward we can get by moving to O2. Thus our goal square should have the reward of 1. And squares near it should have values close to this reward!

Playing the Game

Playing the game now is straightforward, following the examples we've done before. We'll have a certain number of episodes. Within each episode, we make our move, and use the reward to "learn" for our Q-table.

for episode in range(total_episodes):
  obs = env.reset()
  t = 0
  if episode % 100 == 99:
    epsilon *= decay_rate
    epsilon = max(epsilon, min_epsilon)

  while t < max_steps:
    action = choose_action(obs)
    obs2, reward, done, info = env.step(action)
    learn(obs, obs2, reward, action)
    obs = obs2
    t += 1

    if done:
      if reward > 0.0:
        print("Win")
      else:
        print("Lose")
      break

Notice also how we drop the exploration rate epsilon every 100 episodes or so. We can run this, and we'll observe that we lose a lot at first. But by the end we're winning more often than not! At the end of the series, it's a good idea to save the Q-table in some sensible way.

Haskell: Adding a Q-Table

To translate this into Haskell, we first need to account for our new pieces of state. Let's extend our environment type to include two more fields. One will be for our Q-table. We'll use an array for this as well, as this gives convenient accessing and updating syntax. The other will be the current exploration rate:

data FrozenLakeEnvironment = FrozenLakeEnvironment
  { ...
  , qTable :: A.Array (Word, Word) Double
  , explorationRate :: Double
  }

Now we'll want to write two primary functions. First, we'll want to choose our action using the Q-Table. Second, we want to be able to update the Q-Table so we can "learn" a good path.

Both of these will use this helper function. It takes an Observation and the current Q-Table and produces the best score we can get from that location. It also provides us the action index. Note the use of a tuple section to produce indices.

maxScore ::
  Observation ->
  A.Array (Word, Word) Double ->
  (Double, (Word, Word))
maxScore obs table = maximum valuesAndIndices
  where
    indices = (obs, ) <$> [0..3]
    valuesAndIndices = (\i -> (table A.! i, i)) <$> indices

Using the Q-Table

Now let's see how we produce our actions using this table. As with most of our state functions, we'll start by retrieving the environment. Then we'll get our first roll to see if this is an exploration turn or not.

chooseActionQTable ::
  (MonadState FrozenLakeEnvironment m) => m Action
chooseActionQTable = do
  fle <- get
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
  if exploreRoll < explorationRate fle
    ...

If we're exploring, we do another random roll to pick an action and replace the generator. Otherwise we'll get the best scoring move and derive the Action from the returned index. In both cases, we use toEnum to turn the number into a proper Action.

chooseActionQTable ::
  (MonadState FrozenLakeEnvironment m) => m Action
chooseActionQTable = do
  fle <- get
  let (exploreRoll, gen') = randomR (0.0, 1.0) (randomGenerator fle)
  if exploreRoll < explorationRate fle
    then do
      let (actionRoll, gen'') = Rand.randomR (0, 3) gen'
      put $ fle { randomGenerator = gen'' }
      return (toEnum actionRoll)
    else do
      let maxIndex = snd $ snd $
                       maxScore (currentObservation fle) (qTable fle)
      put $ fle {randomGenerator = gen' }
      return (toEnum (fromIntegral maxIndex))

The last big step is to write our learning function. Remember this takes two observations, a reward, and an action. We start by getting our predicted value for the original observation. That is, what score did we expect when we made this move?

learnQTable :: (MonadState FrozenLakeEnvironment m) =>
  Observation -> Observation -> Double -> Action -> m ()
learnQTable obs1 obs2 reward action = do
  fle <- get
  let q = qTable fle
      actionIndex = fromIntegral . fromEnum $ action
      prediction = q A.! (obs1, actionIndex)
  ...

Now we specify our target. This combines the reward (if any) and the greatest score we can get from our new observed state. We use these values to get a newValue, which we put into the Q-Table at the original index. Then we put the new table into our state.

learnQTable :: (MonadState FrozenLakeEnvironment m) =>
  Observation -> Observation -> Double -> Action -> m ()
learnQTable obs1 obs2 reward action = do
  fle <- get
  let q = qTable fle
      actionIndex = fromIntegral . fromEnum $ action
      prediction = q A.! (obs1, actionIndex)
      target = reward + gamma * (fst $ maxScore obs2 q)
      newValue = prediction + learningRate * (target - prediction)
      newQ = q A.// [((obs1, actionIndex), newValue)]
  put $ fle { qTable = newQ }
  where
    gamma = 0.96
    learningRate = 0.81

And just like that, we're pretty much done! We can slide these new functions right into our existing functions!

Conclusion

The rest of the code is straightforward enough. We make a couple tweaks as necessary to our gameLoop so that it actually calls our training function. Then we just update the exploration rate at appropriate intervals. Take a look at our code our Github for more details! This week's code is in FrozenLake2.hs.

We've now got an agent that can play Frozen Lake coherently using Q-Learning! Next time, we'll try to adopt this agent for Blackjack as well. We'll see the similarities between the two games. Then we'll start formulating some ideas to combine the approaches.

James Bowen 5/11/20 James Bowen 5/11/20

Blackjack: Following the Patterns

For a couple weeks now, we've been exploring the basics of Open AI Gym. The Frozen Lake example has been our basic tool so far, and we've now written it in Haskell. We'd like to start training agents for this game soon. But first, we want to make sure we're set up to generalize our idea of an environment.

So this week, we're going to make another small example game. This time, we'll play Blackjack. This will give us an example of an environment that needs a more complex observation state. When we're done with this example, we'll be able to compare our two examples. The end goal is to be able to use the same code to train an algorithm for either of them.

If you want to dive into machine learning, you'll need to understand TensorFlow first! Read this guide to learn how to use TensorFlow with Haskell!

Basic Rules

If you don't know the basic rules of casino blackjack, take a look here. Essentially, we have a deck of cards, and each card has a value. We want to get as high a score as we can without exceeding 21 (a "bust"). Each turn, we want to either "hit" and add another card to our hand, or "stand" and take the value we have.

After we get all our cards, the dealer must then draw cards under specific rules. The dealer must "hit" until their score is 17 or higher, and then "stand". If the dealer busts or our score beats the dealer, we win. If the scores are the same it's a "push".

Here's a basic Card type we'll work with to represent the card values, as well as their scores.

data Card =
  Two | Three | Four | Five |
  Six | Seven | Eight | Nine |
  Ten | Jack | Queen | King | Ace
  deriving (Show, Eq, Enum)

cardScore :: Card -> Word
cardScore Two = 2
cardScore Three = 3
cardScore Four = 4
cardScore Five = 5
cardScore Six = 6
cardScore Seven = 7
cardScore Eight = 8
cardScore Nine = 9
cardScore Ten = 10
cardScore Jack = 10
cardScore Queen = 10
cardScore King = 10
cardScore Ace = 1

The Ace can count as 1 or 11. We account for this in our scoring functions:

-- Returns the base sum, as well as a boolean if we have
-- a "usable" Ace.
baseScore :: [Card] -> (Word, Bool)
baseScore cards = (score, score <= 11 && Ace `elem` cards)
  where
    score = sum (cardScore <$> cards)

scoreHand :: [Card] -> Word
scoreHand cards = if hasUsableAce then score + 10 else score
  where
    (score, hasUsableAce) = baseScore cards

Core Environment Types

As in Frozen Lake, we need to define types for our environment. The "action" type is straightforward, giving only two options for "hit" and "stand":

data BlackjackAction = Hit | Stand
  deriving (Show, Eq, Enum)

Our observation is more complex than in Frozen Lake. We have more information that can guide us than just knowing our location. We'll boil it down to three elements. First, we need to know our own score. Second, we need to know if we have an Ace. This isn't clear from the score, and it can give us more options. Last, we need to know what card the dealer is showing.

data BlackjackObservation = BlackjackObservation
  { playerScore :: Word
  , playerHasAce :: Bool
  , dealerCardShowing :: Card
  } deriving (Show)

Now for our environment, we'll once again store the "current observation" as one of its fields.

data BlackjackEnvironment = BlackjackEnvironment
  { currentObservation :: BlackjackObservation
  ...
  }

The main fields are about the cards in play. We'll have a list of cards for our own hand. Then we'll have the main deck to draw from. The dealer's cards will be a 3-tuple. The first is the "showing" card. The second is the hidden card. And the third is a list for extra cards the dealer draws later.

data BlackjackEnvironment = BlackjackEnvironment
  { currentObservation :: BlackjackObservation
  , playerHand :: [Card]
  , deck :: [Card]
  , dealerHand :: (Card, Card, [Card])
  ...
  }

The last pieces of this will be a boolean for whether the player has "stood", and a random generator. The boolean helps us render the game, and the generator helps us reset and shuffle without using IO.

data BlackjackEnvironment = BlackjackEnvironment
  { currentObservation :: BlackjackObservation
  , playerHand :: [Card]
  , deck :: [Card]
  , dealerHand :: (Card, Card, [Card])
  , randomGenerator :: Rand.StdGen
  , playerHasStood :: Bool
  } deriving (Show)

Now we can use these to write our main game functions. As in Frozen Lake, we'll want functions to render the environment and reset it. We won't go over those in this article. But we will focus on the core step function.

Playing the Game

Our step function starts out simply enough. We retrieve our environment and analyze the action we get.

stepEnv :: (Monad m) => BlackjackAction ->
  StateT BlackjackEnvironment m (BlackjackObservation, Double, Bool)
stepEnv action = do
  bje <- get
  case action of
    Stand -> ...
    Hit -> ...

Below, we'll write a function to play the dealer's hand. So for the Stand branch, we'll update the state variable for the player standing, and call that helper.

stepEnv action = do
  bje <- get
  case action of
    Stand -> do
      put $ bje { playerHasStood = True }
      playOutDealerHand
    Hit -> ...

When we hit, we need to determine the top card in the deck. We'll add this to our hand to get the new player score. All this information goes into our new observation, and the new state of the game.

stepEnv action = do
  bje <- get
  case action of
    Stand -> ...
    Hit -> do
      let (topCard : remainingDeck) = deck bje
          pHand = playerHand bje
          currentObs = currentObservation bje
          newPlayerHand = topCard : pHand
          newScore = scoreHand newPlayerHand
          newObservation = currentObs
            { playerScore = newScore
            , playerHasAce = playerHasAce currentObs ||
                             topCard == Ace}
      put $ bje { currentObservation = newObservation
                , playerHand = newPlayerHand
                , deck = remainingDeck }
      ...

Now we need to analyze the player's score. If it's greater than 21, we've busted. We return a reward of 0.0 and we're done. If it's exactly 21, we'll treat that like a "stand" and play out the dealer. Otherwise, we'll continue by returning False.

stepEnv action = do
  bje <- get
  case action of
    Stand -> ...
    Hit -> do
      ...
      if newScore > 21
        then return (newObservation, 0.0, True)
        else if newScore == 21
          then playOutDealerHand
          else return (newObservation, 0.0, False)

Playing out the Dealer

To wrap up the game, we need to give cards to the dealer until their score is high enough. So let's start by getting the environment and scoring the dealer's current hand.

playOutDealerHand :: (Monad m) =>
  StateT BlackjackEnvironment m (BlackjackObservation, Double, Bool)
playOutDealerHand = do
  bje <- get
  let (showCard, hiddenCard, restCards) = dealerHand bje
      currentDealerScore = scoreHand (showCard : hiddenCard : restCards)

If the dealer's score is less than 17, we can draw the top card, add it to their hand, and recurse.

playOutDealerHand :: (Monad m) => StateT BlackjackEnvironment m (BlackjackObservation, Double, Bool)
playOutDealerHand = do
  ...
  if currentDealerScore < 17
    then do
      let (topCard : remainingDeck) = deck bje
      put $ bje { dealerHand =
                    (showCard, hiddenCard, topCard : restCards)
                , deck = remainingDeck}
      playOutDealerHand
    else ...

Now all that's left is analyzing the end conditions. We'll score the player's hand and compare it to the dealer's. If the dealer has busted, or the player has the better score, we'll give a reward of 1.0. If they're the same, the reward is 0.5. Otherwise, the player loses. In all cases, we return the current observation and True as our "done" variable.

playOutDealerHand :: (Monad m) => StateT BlackjackEnvironment m (BlackjackObservation, Double, Bool)
playOutDealerHand = do
  bje <- get
  let (showCard, hiddenCard, restCards) = dealerHand bje
      currentDealerScore = scoreHand
        (showCard : hiddenCard : restCards)
  if currentDealerScore < 17
    then ...
    else do
      let playerScore = scoreHand (playerHand bje)
          currentObs = currentObservation bje
      if playerScore > currentDealerScore || currentDealerScore > 21
        then return (currentObs, 1.0, True)
        else if playerScore == currentDealerScore
          then return (currentObs, 0.5, True)
          else return (currentObs, 0.0, True)

Odds and Ends

We'll also need code for running a loop and playing the game. But that code though looks very similar to what we used for Frozen Lake. This is a promising sign for our hopes to generalize this with a type class. Here's a sample playthrough of the game. As inputs, 0 means "hit" and 1 means "stand".

So in this first game, we start with a King and 9, and see the dealer has a 6 showing. We "stand", and the dealer busts.

6 X

K 9
19 # Our current score
1   # Stand command

1.0 # Reward
Episode Finished

6 9 8 # Dealer's final hand
23  # Dealer's final (busted) score

K 9
19

In this next example, we try to hit on 13, since the dealer has an Ace. We bust, and lose the game.

A X

3 J
13
0

0.0
Episode Finished

A X

K 3 J
23

Conclusion

Of course, there are a few ways we could make this more complicated. We could do iterated blackjack to allow card-counting. Or we could add advanced moves like splitting and doubling down. But that's not necessary for our purposes. The main point is that we have two fully functional games we can work with!

Next time, we'll start digging into the machine learning process. We'll see what techniques we can use with the Open Gym in Python and start translating those to Haskell.

We left out quite a bit of code in this example, particularly around setup. Take a look at our Github repository to see all the details!

James Bowen 5/4/20 James Bowen 5/4/20

Frozen Lake in Haskell

Last time on MMH, we began our investigation into Open AI Gym. We started by using the Frozen Lake toy example to learn about environments. An environment is a basic wrapper that has a specific API for manipulating the game.

Last week's work was mostly in Python. But this week, we're going to do a deep dive into Haskell and consider how to write the Frozen Lake example. We'll see all the crucial functions from the Environment API as well as how to play the game. Take a look at our Github repository to see any extra details about this code!

This process will culminate with training agents to complete these games with machine learning. This will involve TensorFlow. So if you haven't already, download our Haskell Tensor Flow Guide. It will teach you how to get the framework up and running on your machine.

Core Types

In the previous part, we started defining our environment with generic values. For example, we included the action space and observation space. For now, we're actually going to make things more specific to the Frozen Lake problem. This will keep our example much simpler for now. In the coming weeks, we'll start examining how to generalize the idea of an environment and spaces.

We need to start with the core types of our application. We'll begin with a TileType for our board, as well as observations and actions.

data TileType =
  Start |
  Goal |
  Frozen |
  Hole
  deriving (Show, Eq)

type Observation = Word

data Action =
  MoveLeft |
  MoveDown |
  MoveRight |
  MoveUp
  deriving (Show, Eq, Enum)

As in Python, each observation will be a single number indicating where we are on the board. We'll have four different actions. The Enum instance will help us convert between these constructors and numbers.

Now let's consider the different elements we actually need within the environment. The game's main information is the grid of tiles. We'll store this as an Array. The indices will be our observation values, and the elements will be the TileType. For convenience, we'll also store the dimensions of our grid:

data FrozenLakeEnvironment = FrozenLakeEnvironment
  { grid :: Array Word TileType
  , dimens :: (Word, Word) -- Rows, Columns
  ...
  }

We also need some more information. We need the current player location, an Observation. We'll want to know the previous action, for rendering purposes. The game also stores the chance of slipping each turn. The last piece of state we want is the random generator. Storing this within our environment lets us write our step function in a pure way, without IO.

data FrozenLakeEnvironment = FrozenLakeEnvironment
  { grid :: Array Word TileType
  , dimens :: (Word, Word) -- Rows, Cols
  , currentObservation :: Observation
  , previousAction :: Maybe Action
  , slipChance :: Double
  , randomGenerator :: Rand.StdGen
  }

API Functions

Now our environment needs its API functions. We had three main ones last time. These were reset, render, and step. Last week we wrote these to take the environment as an explicit parameter. But this time, we'll write them in the State monad. This will make it much easier to chain these actions together later. Let's start with reset, the simplest function. All it does is set the observation as 0 and remove any previous action.

resetEnv :: (Monad m) => StateT FrozenLakeEnvironment m Observation
resetEnv = do
  let initialObservation = 0
  fle <- get
  put $ fle { currentObservation = initialObservation
            , previousAction = Nothing }
  return initialObservation

Rendering is a bit more complicated. When resetting, we can use any underlying monad. But to render, we'll insist that the monad allows IO, so we can print to console. First, we get our environment and pull some key values out of it. We want the current observation and each row of the grid.

renderEnv :: (MonadIO m) => StateT FrozenLakeEnvironment m ()
renderEnv = do
  fle <- get
  let currentObs = currentObservation fle
      elements = A.assocs (grid fle)
      numCols = fromIntegral . snd . dimens $ fle
      rows = chunksOf numCols elements
  ...

We use chunksOf with the number of columns to divide our grid into rows. Each element of each row-list is the pairing of the "index" with the tile type. We keep the index so we can compare it to the current observation. Now we'll write a helper to render each of these rows. We'll have another helper to print a character for each tile type. But we'll print X for the current location.

renderEnv :: (MonadIO m) => StateT FrozenLakeEnvironment m ()
renderEnv = do
  ...
  where
    renderRow currentObs row = do
      forM_ row (\(idx, t) -> liftIO $ if idx == currentObs
        then liftIO $ putChar 'X'
        else liftIO $ putChar (tileToChar t))
      putChar '\n'

tileToChar :: TileType -> Char
...

Then we just need to print a line for the previous action, and render each row:

renderEnv :: (MonadIO m) => StateT FrozenLakeEnvironment m ()
renderEnv = do
  fle <- get
  let currentObs = currentObservation fle
      elements = A.assocs (grid fle)
      numCols = fromIntegral . snd . dimens $ fle
      rows = chunksOf numCols elements
  liftIO $ do
    putStrLn $ case (previousAction fle) of
      Nothing -> ""
      Just a -> "    " ++ show a
    forM_ rows (renderRow currentObs)
  where
    renderRow = ...

Stepping

Now let's see how we update our environment! This will also be in our State monad (without any IO constraint). It will return a 3-tuple with our new observation, a "reward", and a boolean for if we finished. Once again we start by gathering some useful values.

stepEnv :: (Monad m) => Action
  -> StateT FrozenLakeEnvironment m (Observation, Double, Bool)
stepEnv act = do
  fle <- get
  let currentObs = currentObservation fle
  let (slipRoll, gen') = Rand.randomR (0.0, 1.0) (randomGenerator fle)
  let allLegalMoves = legalMoves currentObs (dimens fle)
  let (randomMoveIndex, finalGen) =
          randomR (0, length AllLegalMoves - 1) gen'
  ...

-- Get all the actions we can do, given the current observation
-- and the number of rows and columns
legalMoves :: Observation -> (Word, Word) -> [Action]
...

We now have two random values. The first is for our "slip roll". We can compare this with the game's slipChance to determine if we try the player's move or a random move. If we need to do a random move, we'll use randomMoveIndex to figure out which random move we'll do.

The only other check we need to make is if the player's move is "legal". If it's not we'll stand still. The applyMoveUnbounded function tells us what the next Observation should be for the move. For example, we add 1 for moving right, or subtract 1 for moving left.

stepEnv :: (Monad m) => Action
  -> StateT FrozenLakeEnvironment m (Observation, Double, Bool)
stepEnv act = do
  ...
  let newObservation = if slipRoll >= slipChance fle
        then if act `elem` allLegalMoves
          then applyMoveUnbounded
                  act currentObs (snd . dimens $ fle)
          else currentObs
        else applyMoveUnbounded
               (allLegalMoves !! nextIndex)
               currentObs
               (snd . dimens $ fle)
  ...

applyMoveUnbounded ::
  Action -> Observation -> Word -> Observation
...

To wrap things up we have to figure out the consequences of this move. If it lands us on the goal tile, we're done and we get a reward! If we hit a hole, the game is over but our reward is 0. Otherwise there's no reward and the game isn't over. We put all our new state data into our environment and return the necessary values.

stepEnv :: (Monad m) => Action
  -> StateT FrozenLakeEnvironment m (Observation, Double, Bool)
stepEnv act = do
  ...
  let (done, reward) = case (grid fle) A.! newObservation of
        Goal -> (True, 1.0)
        Hole -> (True, 0.0)
        _ -> (False, 0.0)
  put $ fle { currentObservation = newObservation
            , randomGenerator = finalGen
            , previousAction = Just act }
  return (newObservation, reward, done)

Playing the Game

One last step! We want to be able to play our game by creating a gameLoop. The final result of our loop will be the last observation and the game's reward. As an argument, we'll pass an expression that can generate an action. We'll give two options. One for reading a line from the user, and another for selecting randomly. Notice the use of toEnum, so we're entering numbers 0-3.

gameLoop :: (MonadIO m) =>
  StateT FrozenLakeEnvironment m Action ->
  StateT FrozenLakeEnvironment m (Observation, Double)
gameLoop chooseAction = do
  ...

chooseActionUser :: (MonadIO m) => m Action
chooseActionUser = (toEnum . read) <$> (liftIO getLine)

chooseActionRandom :: (MonadIO m) => m Action
chooseActionRandom = toEnum <$> liftIO (Rand.randomRIO (0, 3))

Within each stage of the loop, we render the environment, generate a new action, and step the game. Then if we're done, we return the results. Otherwise, recurse. The power of the state monad makes this function quite simple!

gameLoop :: (MonadIO m) =>
  StateT FrozenLakeEnvironment m Action ->
  StateT FrozenLakeEnvironment m (Observation, Double)
gameLoop chooseAction = do
  renderEnv 
  newAction <- chooseAction
  (newObs, reward, done) <- stepEnv newAction
  if done
    then do
      liftIO $ print reward
      liftIO $ putStrLn "Episode Finished"
      renderEnv
      return (newObs, reward)
    else gameLoop chooseAction

And now to play our game, we start with a simple environment and execute our loop!

basicEnv :: IO FrozenLakeEnvironment
basicEnv = do
  gen <- Rand.getStdGen
  return $ FrozenLakeEnvironment
    { currentObservation = 0
    , grid = A.listArray (0, 15) (charToTile <$> "SFFFFHFHFFFHHFFG")
    , slipChance = 0.0
    , randomGenerator = gen
    , previousAction = Nothing
    , dimens = (4, 4)
    }

playGame :: IO ()
playGame = do
  env <- basicEnv
  void $ execStateT (gameLoop chooseActionUser) env

Conclusion

This example illustrates two main lessons. First, the state monad is very powerful for managing any type of game situation. Second, defining our API makes implementation straightforward. Next week, we'll explore another toy example with a different state space. This will lead us on the path to generalizing our data structure.

Remember, if you need any more details about these code samples, take a look at the full code on Github! You should also subscribe to Monday Morning Haskell! You'll get our monthly newsletter and access to our subscriber resources!

James Bowen 4/27/20 James Bowen 4/27/20

Open AI Primer: Frozen Lake!

Last year, we spent quite a bit of time on this blog creating a game using the Gloss library. This process culminated in trying to use machine learning to train an agent to play our Maze Game well. The results were not particularly successful. But I've always wanted to come back to the idea of reinforcement learning for game agents.

The Open AI Gym is an open source project for teaching the basics of reinforcement learning. It provides a framework for understanding how we can make agents that evolve and learn. It's written in Python, so this first article will be mostly in Python. But we can (and will) try to implement many of the ideas in Haskell. This week, we'll start exploring some of the core concepts. We'll examine what exactly an "environment" is and how we can generalize the concept. In time, we'll also see how Gloss can help us.

We'll ultimately use machine learning to train our agents. So you'll want some guidance on how to do that in Haskell. Read our Machine Learning Series and download our Tensor Flow guide to learn more!

Frozen Lake

To start out our discussion of AI and games, let's go over the basic rules of one of the simplest examples, Frozen Lake. In this game, our agent controls a character that is moving on a 2D "frozen lake", trying to reach a goal square. Aside from the start square ("S") and the goal zone ("G"), each square is either a frozen tile ("F") or a hole in the lake ("H"). We want to avoid the holes, moving only on the frozen tiles. Here's a sample layout:

SFFF
FHFH
FFFH
HFFG

So a safe path would be to move down twice, move right twice, down again, and then right again. What complicates the matter is that tiles can be "slippery". So each turn, there's a chance we won't complete our move, and will instead move to a random neighboring tile.

Playing the Game

Now let's see what it looks like for us to actually play the game using the normal Python code. This will get us familiar with the main ideas of an environment. We start by "making" the environment and setting up a loop where the user can enter their input move each turn:

import gym
env = gym.make('FrozenLake-v0')
env.reset()

while True:
  move = input("Please enter a move:")
  ...

There are several functions we can call on the environment to see it in action. First, we'll render it, even before making our move. This lets us see what is going on in our console. Then we have to step the environment using our move. The step function makes our move and provides us with 4 outputs. The primary ones we're concerned with are the "done" value and the "reward". These will tell us if the game is over, and if we won.

while True:
  env.render()
  move = input("Please enter a move:")
  action = int(move)
  observation, reward, done, info = env.step(action)
  if done:
    print(reward)
    print("Episode finished")
    env.render()
    break

We use numbers in our moves, which our program converts into the input space for the game. (0 = Left, 1 = Down, 2 = Right, 3 = Up).

We can also play the game automatically, for several iterations. We'll select random moves by using action_space.sample(). We'll discuss what the action space is in the next part. We can also use reset on our environment at the end of each iteration to return the game to its initial state.

for i in range(20):
  observation = env.reset()
  for t in range(100):
    env.render()
    print(observation)
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
      print("Episode finished after {} timesteps".format(t + 1))
      break

env.close()

These are the basics of the game. Let's go over some of the details of how an environment works, so we can start imagining how it will work in Haskell.

Observation and Action Spaces

The first thing to understand about environments is that each environment has an "observation" space and an "action" space. The observation space gives us a numerical representation of the state of the game. This doesn't include the actual layout of our board, just the mutable state. For our frozen lake example, this is only the player's current position. We could use two numbers for the player's row and column. But in fact we use a single number, the row number multiplied by the column number.

Here's an example where we print the observation after moving right twice, and then down. We have to call reset before using an environment. Then calling this function gives us an observation we can print. Then, after each step, the first return value is the new observation.

import gym
env = gym.make('FrozenLake-v0')
o = env.reset()
print(o)
o, _, _, _ = env.step(2)
print(o)
o, _, _, _ = env.step(2)
print(o)
o, _, _, _ = env.step(1)
print(o)


# Console output
0
1
2
6

So, with a 4x4 grid, we start out at position 0. Then moving right increases our position index by 1, and moving down increases it by 4.

This particular environment uses a "discrete" environment space of size 16. So the state of the game is just a number from 0 to 15, indicating where our agent is. More complicated games will naturally have more complicated state spaces.

The "action space" is also discrete. We have four possible moves, so our different actions are the integers from 0 to 3.

import gym
env = gym.make('FrozenLake-v0')
print(env.observation_space)
print(env.action_space)

# Console Output
Discrete(16)
Discrete(4)

The observation space and the action space are important features of our game. They dictate the inputs and outputs of the each game move. On each turn, we take a particular observation as input, and produce an action as output. If we can do this in a numerical way, then we'll ultimately be able to machine-learn the program.

Towards Haskell

Now we can start thinking about how to represent an environment in Haskell. Let's think about the key functions and attributes we used when playing the game.

Observation space
Action space
Reset
Step
Render

How would we represent these in Haskell? To start, we can make a type for the different numeric spaces can have. For now we'll provide a discrete space option and a continuous space option.

data NumericSpace =
  Discrete Int |
  Continuous Float

Now we can make an Environment type with fields for these spaces. We'll give it parameters for the observation type and the action type.

data Environment obs act = Environment
  { observationSpace :: NumericSpace
  , actionSpace :: NumericSpace
  ...
  }

We don't know yet all the rest of the data our environment will hold. But we can start thinking about certain functions for it. Resetting will take our environment and return a new environment and an observation. Rendering will be an IO action.

resetEnv :: Environment obs act -> (obs, Environment obs act)

renderEnv :: Environment obs act -> IO ()

The step function is the most important. In Python, this returns a 4-tuple. We don't care about the 4th "info" element there yet. But we do care to return our environment type itself, since we're in a functional language. So we'll return a different kind of 4-tuple.

stepEnv :: Environment obs act -> act
  -> (obs, Float, Bool, Environment obs act)

It's also possible we'll use the state monad here instead, as that could be cleaner. Now this isn't the whole environment obviously! We'd need to store plenty of unique internal state. But what we see here is the start of a typeclass that we'll be able to generalize across different games. We'll see how this idea develops!

Conclusion

Hopefully you've got a basic idea now of what makes up an environment we can run. Next time, we'll push a bit further with our Haskell and implement Frozen Lake there!

James Bowen 11/11/19 James Bowen 11/11/19

Looking Ahead with More Steps!

In last week's article, we set ourselves up to make our agent use temporal difference learning. But TD is actually a whole family of potential learning methods we can use. They intertwine with other concepts in a bigger category of reinforcement learning algorithms.

In this article, we'll consider a couple possible TD approaches. We'll also examine a bit of theory surrounding other reinforcement learning topics.

For a more high level overview of using Haskell and AI, take a look at our Haskell AI Series! This series will also help you better grasp some of the basics of TensorFlow.

One Step Temporal Difference

Temporal difference learning has one general principle. The evaluation of the current game position should be similar to the evaluation of positions in the future. So at any given step we have our "current" evaluation. Then we have a "target" evaluation based on the future. We want to train our network so that the current board gets evaluated more like the target value.

We can see this in the way we defined our model. The tdTrainStep takes two different values, the target evaluation and the current evaluation.

data TDModel = TDModel
  { …
  , tdTrainStep :: TensorData Float -> TensorData Float -> Session ()
  }

And in fact, doing this calculation isn't so different from what we've done before. We'll take the difference between these evaluations, square it, and use reduceSum. This gives our loss function. Then we'll have TensorFlow minimize the loss function.

createTDModel :: Session TDModel
createTDModel = do
  ...
  -- Train Model
  targetEval <- placeholder (Shape [1])
  currentEval <- placeholder (Shape [1])
  let diff = targetEval `sub` currentEval
  let loss = reduceSum (diff `mul` diff)
  trainer <- minimizeWith
    adam loss [hiddenWeights, hiddenBias, outputWeights, outputBias]
  let trainStep = \targetEvalFeed currentEvalFeed ->
        runWithFeeds [feed targetEval targetEvalFeed, feed currentEval currentEvalFeed] trainer
  return $ TDModel
    { ...
    , tdTrainStep = trainStep
    }

Let's now recall how we got our target value last week. We looked at all our possible moves, and used them to advance the world one step. We then took the best outcome out of those, and that was our target value. Because we're advancing one step into the world, we call this "one-step" TD learning.

Adding More Steps

But there's no reason we can't look further into the future! We can consider what the game will look like in 2 moves, not just one move! To do this, let's generalize our function for stepping forward. It will be stateful over the same parameters as our main iteration function. But we'll call it in a way so that it doesn't affect our main values.

We'll make one change to our approach from last time. If a resulting world is over, we'll immediately put the "correct" evaluation value. In our old approach, we would apply this later. Our new function will return the score from advancing the game, the game result, and the World at this step.

advanceWorldAndGetScore :: Float -> TDModel
  -> StateT (World, StdGen) Session (Float, GameResult, World)
advanceWorldAndGetScore randomChance model = do
  (currentWorld, gen) <- get
  let allMoves = possibleMoves currentWorld
  let newWorlds = fst <$> map ((flip stepWorld) currentWorld) allMoves
  allScoresAndResults <- Data.Vector.fromList <$>
    (forM newWorlds $ \w -> case worldResult w of
      GameLost -> return (0.0, GameLost)
      GameWon -> return (1.0, GameWon)
      GameInProgress -> do
        let worldData = encodeTensorData
              (Shape [1, inputDimen]) (vectorizeWorld8 w)
        scoreVector <- lift $ (tdEvaluateWorldStep model) worldData
        return $ (Data.Vector.head scoreVector, GameInProgress))

  let (chosenIndex, newGen) = bestIndexOrRandom
                                allScoresAndResults gen 
  put (newWorlds !! chosenIndex, newGen)
  let (finalScore, finalResult) = allScoresAndResults ! chosenIndex
  return $ (finalScore, finalResult, newWorlds !! chosenIndex)
  where
    -- Same as before, except with resultOrdering
    bestIndexOrRandom :: Vector (Float, GameResult) -> StdGen
      -> (Int, StdGen)
    ...

    -- First order by result (Win > InProgress > Loss), then score
    resultOrdering :: (Float, GameResult) -> (Float, GameResult)
      -> Ordering
    ...

Now we'll call this from our primary iteration function. It seems a little strange. We unwrap the World from our state only to re-wrap it in another state call. But it will make more sense in a second!

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get

  ((chosenNextScore, finalResult, nextWorld), (_, newGen)) <-
    lift $ runStateT
      (advanceWorldAndGetScore randomChance model)
      (currentWorld, gen)

So at the moment, our code is still doing one-step temporal difference. But here's the key. We can now sequence our state action to look further into the future. We'll then get many values to compare for the score. Here's what it looks like for us to look two moves ahead and take the average of all the scores we get:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get

  let numSteps = 2
  let repeatedUpdates = sequence $ replicate numSteps
        (advanceWorldAndGetScore randomChance model)
  (allWorldResults, (_, newGen)) <- lift $
    runStateT repeatedUpdates (currentWorld, gen)

  let allScores = map (\(s, _, _) -> s) allWorldResults
  let averageScore = sum allScores / fromIntegral (length allScores)
  let nextScoreData = encodeTensorData
        (Shape [1]) (Data.Vector.singleton averageScore)
  ...

When it comes to continuing the function though, we only consider the first world and result:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  let (_, result1, nextWorld1) = Prelude.head allWorldResults
  put (nextWorld1, newGen)
  case result1 of
    GameLost -> return False
    GameWon -> return True
    GameInProgress -> runWorldIteration randomChance model

We could take more steps if we wanted! We could also change how we get our target score. We could give more weight to near-future scores. Or we could give more weight to scores in the far future. These are all just parameters we can tune now. We can now refer to our temporal difference algorithm as "n-step", rather than 1-step.

Monte Carlo vs. Dynamic Programming

With different parameters, our TD approach can look like other common learning approaches. Dynamic Programming is an approach where we adjust our weights after each move in the game. We expect rewards for a particular state to be like those of near-future states. We use the term "bootstrapping" for "online" learning approaches like this. TD learning also applies bootstrapping.

However, dynamic programming requires that we have a strong model of the world. That is, we would have to know the probability of getting into certain states from our current state. This allows us to more accurately predict the future. We could apply this approach to our maze game on a small enough grid. But the model size would increase exponentially with the grid size and enemies! So our approach doesn't actually do this! We can advance the world with a particular move, but we don't have a comprehensive model of how the world works.

In this regard, TD learning is more like Monte Carlo learning. This algorithm is "model free". But it is not an online algorithm! We must play through an entire episode of the game before we can update the weights. We could take our "n-step" approach above, and play it out over the course of the entire game. If we then chose to provide the full weighting to the final evaluation, our model would be like Monte Carlo!

In general, the more steps we add to our TD approach, the more it approximates Monte Carlo learning. The fewer steps we have, the more it looks like dynamic programming.

TD Lambda

TD Gammon, the algorithm we mentioned last time, uses a variation of TD learning called "TD Lambda". It involves looking both forward in time as well as backwards. It observes that the best solutions lie between the extremes of one-step TD and Monte Carlo.

Academic literature can help give a more complete picture of machine learning. One great text is Reinforcement Learning, by Sutton and Barto. It's one of the authoritative texts on the topics we've discussed in this article!

What's Next

This concludes our exploration of AI within the context of our Maze Game. We'll come back to AI and Machine Learning again soon. Next week, we'll start tackling a new subject in the realm of functional programming, something we've never looked at before on this blog! Stay tuned!

James Bowen 11/4/19 James Bowen 11/4/19

Setting Up Our Model with Look-Ahead

Last week we went over some of the basics of Temporal Difference (TD) learning. We explored a bit of the history, and compared it to its cousin, Q-Learning. Now let's start getting some code out there. Since there's a lot in common with Q-Learning, we'll want a similar structure.

This is at least the third different model we've defined over the course of this series. So we can now start observing the patterns we see in developing these algorithms. Here's a quick outline, before we get started:

Define the inputs and outputs of the system.
Define the data model. This should contain the weight variables we are trying to learn. By including them in the model, we can output our results later. It should also contain important Session actions, such as training.
Create the model
Run iterations using our model

We'll follow this outline throughout the article!

If you're new to Haskell and machine learning, a lot of the code we write here won't make sense. You should start off a little easier with our Haskell AI Series. You should also download our Haskell Tensor Flow Guide.

Inputs and Outputs

For our world features, we'll stick with our hand-crafted feature set, but simplified. Recall that we selected 8 different features for every location our bot could move to. We'll stick with these 8 features. But we only need to worry about them for the current location of the bot. We'll factor in look-ahead by advancing the world for our different moves. So the "features" of adjacent squares are irrelevant. This vectorization is easy enough to get using produceLocationFeatures:

vectorizeWorld8 :: World -> V.Vector Float
vectorizeWorld8 w = V.fromList (fromIntegral <$>
  [ lfOnActiveEnemy standStill
  , lfShortestPathLength standStill
  , lfManhattanDistance standStill
  , lfEnemiesOnPath standStill
  , lfNearestEnemyDistance standStill
  , lfNumNearbyEnemies standStill
  , lfStunAvailable standStill
  , lfDrillsRemaining standStill
  ])
  where
    standStill = produceLocationFeatures
      (playerLocation . worldPlayer $ w) w False

We also don't need to be as concerned about exploring the maze with this agent. We'll be defining what its possible moves are at every turn. This is a simple matter of using this function we have from our game:

possibleMoves :: World -> [PlayerMove]

We should also take this opportunity to specify the dimensions of our network. We'll use 20 hidden units:

inputDimen :: Int64
inputDimen = 8

hiddenDimen :: Int64
hiddenDimen = 20

outputDimen :: Int64
outputDimen = 1

Define the Model

Now let's define our data model. As in the past, we'll use a dense (fully-connected) neural network with one hidden layer. This means we'll expose two sets of weights and biases:

data TDModel = TDModel
  { tdHiddenWeights :: Variable Float
  , tdHiddenBias :: Variable Float
  , tdOutputWeights :: Variable Float
  , tdOutputBias :: Variable Float
  ...
  }

We'll also have two different actions to take with our tensor graph, as we had with Q-Learning. The first will be for evaluating a single world state. The second will take an expected score for the world state as well as the actual score for a world state. It will compare them and train our model:

data TDModel = TDModel
  { ...
  , tdEvaluateWorldStep :: TensorData Float -> Session (Vector Float)
  , tdTrainStep :: TensorData Float -> TensorData Float -> Session ()
  }

Building the Model

Now we need to construct this model. We'll start off as always by initializing random variables for our weights and biases. We'll also make a placeholder for our world input:

createTDModel :: Session TDModel
createTDModel = do
  (worldInputVector :: Tensor Value Float) <-
    placeholder (Shape [1, inputDimen])
  hiddenWeights <- truncatedNormal (vector [inputDimen, hiddenDimen])
    >>= initializedVariable
  hiddenBias <- truncatedNormal (vector [hiddenDimen])
    >>= initializedVariable
  outputWeights <- truncatedNormal (vector [hiddenDimen, outputDimen])
    >>= initializedVariable
  outputBias <- truncatedNormal (vector [outputDimen])
    >>= initializedVariable
  ...

Each layer of our dense network consists of a matrix multiplication by the weights, and adding the bias vector. Between the layers, we'll apply relu activation. We conclude by running the output vector with an input feed:

createTDModel :: Session TDModel
createTDModel = do
  ...
  let hiddenLayerResult = relu $
        (worldInputVector `matMul` (readValue hiddenWeights))
        `add` (readValue hiddenBias)
  let outputLayerResult =
        (hiddenLayerResult `matMul` (readValue outputWeights))
        `add` (readValue outputBias)
  let evaluateStep = \inputFeed -> runWithFeeds
        [feed worldInputVector inputFeed] outputLayerResult
  ...

We'll leave the training step undefined for now. We'll work on that next time.

createTDModel :: Session TDModel
createTDModel = do
  …
  return $ TDModel
    { tdHiddenWeights = hiddenWeights
    , tdHiddenBias = hiddenBias
    , tdOutputWeights = outputWeights
    , tdOutputBias = outputBias
    , tdEvaluateWorldStep = evaluateStep
    , tdTrainStep = undefined
    }

Running World Iterations

Much of the skeleton and support code remains the same from Q-Learning. But let's go over the details of running a single iteration on one of our worlds. This function will take our model as a parameter, as well as a random move chance. (Recall that adding randomness to our moves will help us avoid a stagnant model). It will be stateful over the World and a random generator.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...

We'll start off by getting all the possible moves from our current position. We'll step the world forward for each one of these moves. Then we'll feed the resulting worlds into our model. This will give us the scores for every move:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get
  let allMoves = possibleMoves currentWorld
  let newWorlds = fst <$> map ((flip stepWorld) currentWorld) allMoves
  (allScores :: Vector Float) <-
    Data.Vector.fromList <$> (forM newWorlds $ \w -> do 
      let worldData = encodeTensorData
            (Shape [1, inputDimen]) (vectorizeWorld8 w)
      scoreVector <- lift $ (tdEvaluateWorldStep model) worldData
      return $ Data.Vector.head scoreVector)
  ...

Now we need to take a similar action to what we had with Q-Learning. We'll roll the dice, and either select the move with the best score, or we'll select a random index.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let (chosenIndex, newGen) = bestIndexOrRandom allScores gen
  ...
  where
    bestIndexOrRandom :: Vector Float -> StdGen -> (Int, StdGen)
    bestIndexOrRandom scores gen =
      let (randomMoveRoll, gen') = randomR (0.0, 1.0) gen
          (randomIndex, gen'') = randomR (0, 1) gen'
      in  if randomMoveRoll < randomChance
            then (randomIndex, gen'')
            else (maxIndex scores, gen')

Now that we have our "chosen" move and its score, we'll encode that score as data to pass to the training step. The exception to this is if the game ends. In that case, we'll have a "true" score of 1 or 0 to give. While we're at it, we can also calculate the continuationAction. This is either returning a boolean for ending the game, or looping again.

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let nextWorld = newWorlds !! chosenIndex
  put (nextWorld, newGen)
  let (chosenNextScore, continuationAction) =
        case worldResult nextWorld of
          GameLost -> (0.0, return False)
          GameWon -> (1.0, return True)
          GameInProgress -> ( allScores ! chosenIndex
                            , runWorldIteration randomChance model)
  let nextScoreData = encodeTensorData
        (Shape [1]) (Data.Vector.singleton chosenNextScore)
  ...

We'll also encode the evaluation of our current world. Then we'll pass these values to our training step, and run the continuation!

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  ...
  let currentWorldData = encodeTensorData
        (Shape [1, inputDimen]) (vectorizeWorld8 currentWorld)
  currentScoreVector <- lift $
    (tdEvaluateWorldStep model) currentWorldData
  let currentScoreData = encodeTensorData
        (Shape [1]) currentScoreVector

  lift $ (tdTrainStep model) nextScoreData currentScoreData

  continuationAction

What's Next?

We've now got the basic framework set up for our TD agent. Next time, we'll start digging into the actual formula we use to learn the weights. It's a little more complicated than some of the previous loss functions we've dealt with in the past.

If you want to get started with Haskell and Tensor Flow, download our Haskell Tensor Flow Guide. It will help you learn the basics of this complicated library!

James Bowen 10/28/19 James Bowen 10/28/19

Temporal Difference Primer

Last week we finished our exploration of supervised learning with our maze game. We explored a more complex model that used convolution and pooling. This week, we're going back to "unsupervised" learning. We'll consider another approach that does not require the specification of "correct" outputs.

This approach is Temporal Difference Learning (TD Learning). It relies on having a function to evaluate a game position. Its main principle is that the current position should have a similar evaluation to positions in the near future.

Our evaluation function will use weights whose values our training program will learn. We'll want to learn these weights to minimize the difference between game evaluations. In this article, we'll take a high level look at this approach, before we get into the details next time.

History of TD Learning

The concept of TD learning was first developed in the 1980's. One of the more famous applications of TD learning in the 1990's was to learn an AI for Backgammon, called TD Gammon. This agent could play the game at an intermediate human level. It did this initially with no hand-crafting of any of the game rules or any algorithm.

Getting to this level with a "knowledge free" algorithm was almost unheard of at the time. When providing hand-crafted features, the agent could then play at a near-expert level. It explored many possibilities that human players had written off. In doing so, it contributed new ideas to high level backgammon play. It was an important breakthrough in unsupervised techniques.

Q-Learning vs. TD Learning

A few weeks back, we explored Q-Learning. And at first glance, Q-Learning and TD learning might sound similar. But with temporal difference, we'll be learning a different function. In Q-Learning, we learned the Q function. This is a function that takes in our current game board and provides a score for each possible move. With TD, we'll be learning what we call the V function. This function is a direct evaluation of the current board.

With our game mechanics, our agent chooses between 10 different moves. So the "output" vector of our Q-Learning network had size 10. Now in temporal difference learning, we'll only output a single number. This will be an "evaluation", or score, of the current position.

If a game has more than 2 outcomes, you would want the evaluation function to give a score for each of them. But our game has a binary outcome, so one number is enough.

Basics

Despite this difference, our TensorFlow code will have a similar structure to Q-Learning. Here's a high level overview:

Our model will take an arbitrary game state and produce a score.
At each iteration, we will get the model's output score on all possible moves from that position. We'll account for enemy moves when doing this. We will then choose the move for the best resulting board.
We will advance the world based on this move, and then pass the resulting world through our model again.
Then, adjust the weights so that the evaluations of the new world and the original world are more similar.
If the resulting world is either a "win" or a "loss", we'll use the correct value (1 or 0) as the evaluation. Otherwise, we'll use our evaluation function.

What's Next

Next time, we'll dig into more specifics. It will be a bit tricky to use an evaluation function for our game in conjunction with TensorFlow. But once we have that, we can get into the meatier parts of this algorithm. We'll see exactly what operations we need to train our agent.

To learn more about using Haskell with AI, read our Haskell AI Series! This series shows some of the unique ideas that Haskell can bring to the world of machine learning.

James Bowen 10/21/19 James Bowen 10/21/19

Enemies, Convolution and Pooling

Let's remember now that last week we were still keeping with the idea of supervised learning. We were trying to get our agent to reach the goal on an empty maze, using own moves as a guide. Since the problem was so simple, we could make an agent that navigating the maze.

But what happens if we introduce enemies into the maze? This will complicate the process. Our manual AI isn't actually that successful in this scenario. But recording our own moves provides a reasonable data set.

First, we need to update the features a little bit. Recall that we're representing our board so that every grid space has a single feature. In our old representation, we used a value of 100 to represent the target space, and 10 to represent our own space. We need to expand this representation. We want to know where the enemies are, if they're stunned, and if we still have our stun.

Let's use 25.0 to represent our position if we have the stun. If we don't have our stun available, we'll use 10.0 instead. For positions containing active enemies, we'll use -25.0. If the enemy is in the stunned state, we'll use -10.0.

vectorizeWorld :: World -> V.Vector Float
vectorizeWorld w = finalFeatures
  where
    initialGrid = V.fromList $ take 100 (repeat 0.0)
    (px, py) = playerLocation (worldPlayer w)
    (gx, gy) = endLocation w
    playerLocIndex = (9 - py) * 10 + px
    playerHasStun = playerCurrentStunDelay (worldPlayer w) == 0
    goalLocIndex = (9 - gy) * 10 + gx
    finalFeatures = initialGrid V.//
      ([ (playerLocIndex
       , if playerHasStun then 25.0 else 10.0)
       , (goalLocIndex, 100.0)] ++ enemyLocationUpdates)

    enemyLocationUpdates = enemyPair <$> (worldEnemies w)

    enemyPair e =
      let (ex, ey) = enemyLocation e
      in  ( (9 - ey) * 10 + ex,
             if enemyCurrentStunTimer e == 0 then -25.0 else -10.0)

Using our moves as a training set, we see some interesting results. It's difficult to get great results on our error rates. The data set is very prone to overfitting. We often end up with training error in the 20-30% range with a test error near 50%. And yet, our agent can consistently win the game! Our problem space is still a bit simplistic, but it is an encouraging result!

Convolution and Pooling

As suggested last week, the ideas of convolution and pooling could be useful on this feature set. Let's try it out, for experimentation purposes. We'll start with a function to create a layer in our network that does convolution and pooling.

buildConvPoolLayer :: Int64 -> Int64 -> Tensor v Float -> Build (Variable Float, Variable Float, Tensor Build Float)
buildConvPoolLayer inputChannels outputChannels input = do
  weights <- truncatedNormal (vector weightsShape)
    >>= initializedVariable
  bias <- truncatedNormal (vector [outputChannels])
    >>= initializedVariable
  let conv = conv2D' convAttrs input (readValue weights)
               `add` readValue bias
  let results = maxPool' poolAttrs (relu conv)
  return (weights, bias, results)
  where
    weightsShape :: [Int64]
    weightsShape = [5, 5, inputChannels, outputChannels]

    -- Create "attributes" for our operations
    -- E.g. How far does convolution slide?
    convStridesAttr = opAttr "strides" .~ ([1,1,1,1] :: [Int64])
    poolStridesAttr = opAttr "strides" .~ ([1,2,2,1] :: [Int64])
    poolKSizeAttr = opAttr "ksize" .~ ([1,2,2,1] :: [Int64])
    paddingAttr = opAttr "padding" .~ ("SAME" :: ByteString)
    dataFormatAttr = opAttr "data_format" .~ ("NHWC" :: ByteString)
    convAttrs = convStridesAttr . paddingAttr . dataFormatAttr
    poolAttrs = poolKSizeAttr . poolStridesAttr . paddingAttr . dataFormatAttr

The input to this layer will be our 10x10 "image" of a grid. The output will be 5x5x32. The "5" dimensions come from halving the board dimension. The "32" will come from the number of channels. We'll replace our first neural network layer with this layer:

createModel :: Build Model
createModel = do
  let batchSize = -1
  let (gridDimen :: Int64) = 10
  let inputChannels = 1
  let (convChannels :: Int64) = 32
  let (nnInputSize :: Int64) = 5 * 5 * 32

  (inputs :: Tensor Value Float) <-
    placeholder [batchSize, moveFeatures]
  (outputs :: Tensor Value Int64) <-
    placeholder [batchSize]

  let convDimens = [batchSize, gridDimen, gridDimen, inputChannels]
  let inputAsGrid = reshape inputs (vector convDimens)
  (convWeights, convBias, convResults) <-
    buildConvPoolLayer inputChannels convChannels inputAsGrid

  let nnInput = reshape
          convResults (vector [batchSize, nnInputSize])
  (nnWeights, nnBias, nnResults) <-
    buildNNLayer nnInputSize moveLabels nnInput

  (actualOutput :: Tensor Value Int64) <- render $
    argMax nnResults (scalar (1 :: Int64))

We have a bit less success here. Our training is slower. We get similar error rate. But now our agent doesn't seem to win, unlike the agent from the pure, dense network. So it's not clear that we can actually treat this grid as an image and make any degree of progress.

What's Next?

Over the course of creating these different algorithms, we've discretized the game. We've broken it down into two phases. One where we move, one where the enemies can move. This will make it easier for us to move back to our evaluation function approach. We can even try using multi-move look-ahead. This could lead us to a position where we can try temporal difference learning. Researchers first used this approach to train an agent how to play Backgammon at a very high level.

Temporal difference learning is a very interesting concept. It's the last approach we'll try with our agent. We'll start out next time with an overview of TD learning before we jump into coding.

For another look at using Haskell with AI problems, read our Haskell and AI series! You can also download our Haskell Tensor Flow Guide for some more help with this library!

James Bowen 10/14/19 James Bowen 10/14/19

Different Feature Schemes

In last week's edition of our Maze AI series, we explored the mechanics of supervised learning. We took the training data we'd been building up and trained an agent on it. We had one set of data to make the AI follow our own human moves, and another to follow our hand-crafted AI. This wasn't particularly successful. The resulting agent had a lot of difficulty navigating the maze and using its stun at the right times.

This week, we'll explore a couple different ways we can expand the feature set. First, we'll try encoding the legality of moves in our feature set. Second, we'll try expanding the feature set to include more data about the grid. This will motivate some other approaches to the problem. We'll conclude by taking the specifics of grid navigation out. We'll let our agent go to work on an empty grid to validate that this is at least a reasonable approach.

For some more reading on using Haskell and AI, take a look at our Haskell AI Series. We explore some reasons why Haskell could be a good fit for AI and machine learning problems. It will also help you through some of the basics of using Haskell and Tensor Flow.

Encoding Legal Moves

Our supervised agent uses our current feature set. Let's remind ourselves what these features are. We have five different directions we can go (up, down, left, right, stand still). And in each of these directions, we calculate 8 different features.

The maze distance to the goal
The manhattan distance to the goal
Whether the location contains an active enemy
The number of enemies on the shortest path to the goal from that location
The distance to the nearest enemy from that location
The number of nearby enemies in manhattan distance terms
Whether our stun is available
The number of drills we have after the move

Some of these features are higher level. We do non-trivial calculations to figure them out. This gives our agent some idea of strategy. But there's not a ton of lower level information available! We zero out the features for a particular spot if it's past the world boundary. But we can't immediately tell from these features if a particular move is legal.

This is a big oversight. It's possible for our AI to learn about the legality of moves from the higher level training data. But it would take a lot more data and a lot more time.

So let's add a feature for how "easy" a move is. A value of 0 will indicate an illegal move, either past the world boundary or through a wall when we don't have a drill. A value of 1 will indicate a move that requires a drill. A value of 2 will indicate a normal move.

We'll add the extra feature into the LocationFeatures type. We'll also add an extra parameter to our produceLocationFeatures function. This boolean indicates whether a drill would be necessary. Note, we don't need to account for WorldBoundary. The value will get zeroed out in that case. We'll call this feature lfMoveEase since a higher value indicates less effort.

data LocationFeatures = LocationFeatures
  { …
  , lfMoveEase :: Int
  }

produceLocationFeatures :: Location -> World -> Bool -> LocationFeatures
produceLocationFeatures location@(lx, ly) w needsDrill = LocationFeatures
  …
  moveEase
  where
    moveEase = if not needsDrill then 2
      else if drillsRemaing > 0 then 1 else 0

It's easy to add the extra parameter to the function call in produceWorldFeatures. We already use case statements on the boundary types. Now we need to account for it when vectorizing our world.

vectorizeWorld :: World -> V.Vector Float
vectorizeWorld w = V.fromList (fromInegral <$>
  [ ...
  , lfMoveEase standStill
  ...
  , zeroIfNull (lfMoveEase <$> up)
  ...
  , zeroIfNull (lfMoveEase <$> right)
  ...
  , zeroIfNull (lfMoveEase <$> down)
  ...
  , zeroIfNull (lfMoveEase <$> left)
  ])

We we train with this feature set, we actually get a good training error, down to around 10%. Thus it can learn our data a bit better. Yet it still can't navigate right.

Expanding the Feature Set

Another option we can try is to serialize the world in a more raw state. We currently use more strategic features. But what about using the information on the board?

Here's a different way to look at it. Let's fix it so that the grid must be 10x10, there must be 2 enemies, and we must start with 2 drills powerups on the map. Let's get these features about the world:

100 features for the grid cells. Each feature will be the integer value corresponding the the wall-shape of that cell. These are hexadecimal, like we have when serializing the maze.
4 features for the player. Get the X and Y coordinates for the position, the current stun delay, and the number of drills remaining.
3 features for each enemy. Again, X and Y coordinates, as well as a stun timer.
2 coordinate features for each drill location. Once a drill gets taken, we'll use -1 and -1.

This will give us a total of 114 features. Here's how it breaks down.

vectorizeWorld :: World -> V.Vector Float
vectorizeWorld w = gridFeatures V.++ playerFeatures V.++
                     enemyFeatures V.++ drillFeatures
  where

    -- 1. Features for the Grid
    mazeStr = Data.Text.unpack $ dumpMaze (worldBoundaries w)
    gridFeatures = V.fromList $
      fromIntegral <$> digitToInt <$> mazeStr

    player = worldPlayer w
    enemies = worldEnemies w

    -- 2. Features for the player
    playerFeatures = V.fromList $ fromIntegral <$>
      [ fst . playerLocation $ player
      , snd . playerLocation $ player
      , fromIntegral $ playerCurrentStunDelay player
      , fromIntegral $ playerDrillsRemaining player
      ]

    -- 3. Features for the two enemies
    enemy1 = worldEnemies w !! 0
    enemy2 = worldEnemies w !! 1
    enemyFeatures = V.fromList $ fromIntegral <$>
      [ fst . enemyLocation $ enemy1
      , snd . enemyLocation $ enemy1
      , fromIntegral $ enemyCurrentStunTimer enemy1
      , fst . enemyLocation $ enemy2
      , snd . enemyLocation $ enemy2
      , fromIntegral $ enemyCurrentStunTimer enemy2
      ]

    -- 4. Features for the drill locations
    drills = worldDrillPowerUpLocations w
    drillFeatures = V.fromList $ fromIntegral <$>
      if length drills == 0 then [-1, -1, -1, -1]
        else if length drills == 1
          then [fst (head drills), snd (head drills), -1, -1]
          else [ fst (head drills), snd (head drills)
               , fst (drills !! 1), snd (drills !! 1)
               ]

As an optimization, we can make the grid features part of the world since they will not change.

Still though, our model struggles to complete the grid when training off this data. Compared to the high-level features, the model doesn't even learn very well. We get training errors around 25-30%, but a test error close to 50%. With more data and time, our model might be able to draw the connection between various features.

We could attempt to make our model more sophisticated. We're working with grid data, which is a little like an image. Image processing algorithms use concepts such as convolution and pooling. This allows them to derive patterns arising from how the grid actually looks. We're only looking at the data as a flat vector.

It's unlikely though that convolution and pooling would help us with this feature set. Our secondary features don't fit into the grid. So we would actually want to add them in at a later stage in the process. Besides, we won't get that much data from taking the average value or the max value in a 2x2 segment of the maze. (This is what pooling does).

If we simplify the problem though, we might find a situation where they'll help.

A Simpler Problem

We're having a lot of difficulty with getting our agent to navigate the maze. So let's throw away the problem of navigation for a second. Can we train an agent that will navigate the empty maze? This should be doable.

Let's start with a bare bones feature set with the goal and current location highlighted in a grid. We'll give a value of 10 for our player's location, and a value of 100 for the target location. We start with a vector of all zeros, and uses Vector.// to modify the proper values:

vectorizeWorld :: World -> V.Vector Float
vectorizeWorld w =
  where
    initialGrid = V.fromList $ take 100 (repeat 0.0)
    (px, py) = playerLocation (worldPlayer w)
    (gx, gy) = endLocation w
    playerLocIndex = (9 - py) * 10 + px
    goalLocIndex = (9 - gy) * 10 + gx
    finalFeatures = initialGrid V.//
      [(playerLocIndex, 10), (goalLocIndex)]

Our AI bot will always follow the same path in this grid, so it will be quite easy for our agent to learn this path! Even if we use our own moves and vary the path a little bit, the agent can still learn it. It'll achieve 100% accuracy on the AI data. It can't get that high on our data, since we might choose different moves for different squares. But we can still train it so it wins every time.

Conclusion

So our results are still not looking great. But next week we'll take this last idea and run a little further with it. We'll keep it so that our features only come from the grid itself. But we'll add a few more complications with enemies. We might find that convolution and pooling are useful in that case.

If you're interested in using Haskell for AI but don't know where to start, read our Haskell AI Series! We discuss some important ideas like why Haskell is a good AI language. We also get into the basics of Tensor Flow with Haskell.