Q-Learning Primer


This week, we're going to take the machine learning process in a different direction than I expected. In the last couple weeks, we've built a simple evaluation function for our world state. We could learn this function using an approach called Temporal Difference learning. We might come back to this approach at some point. But for now, we're actually going to try something a little different.

Instead, we're going to focus on a technique called Q-Learning. Instead of an evaluation function for the world, we're going to learn makePlayerMove. We'll keep most of the same function structure. We're still going to take our world and turn it into a feature space that we can represent as a numeric vector. But instead of producing a single output, we'll give a score for every move from that position. This week, we'll take the basic steps to ready our game for this approach.

As always, check out the Github repository repository for this project. This week's code is on the q-learning branch!

Next week, we'll finally get into some Tensor Flow code. Make sure you're ready for it by reading up on our Tensor Flow Guide!

Vectorizing Inputs

To learn a function, we need to be able to represent both the inputs and the outputs of our system as numeric vectors. We've already done most of the work here. Let's recall our evaluateWorld function. We'll keep the same feature values. But now we'll wrap them, instead of applying scores immediately:

data WorldFeatures = WorldFeatures
  { onActiveEnemy :: Int
  , shortestPathLength :: Int
  , manhattanDistance :: Int
  , enemiesOnPath :: Int
  , nearestEnemyDistance :: Int
  , numNearbyEnemies :: Int
  , stunAvailable :: Int
  , drillsRemaining :: Int

produceWorldFeatures :: World -> WorldFeatures
produceWorldFeatures w = WorldFeatures
  (if onActiveEnemy then 1 else 0)
  (if stunAvailable then 1 else 0)
  (fromIntegral drillsRemaining)
    -- Calculated as before
    onActiveEnemy = ...
    enemiesOnPath = ...
    shortestPathLength = ...
    nearestEnemyDistance = ...
    manhattanDistance = ...
    stunAvailable = ...
    numNearbyEnemies = ...
    drillsRemaining = ...

Now, in our ML code, we'll want to convert this into a vector. Using a vector will enable use to encode this information as a tensor.

vectorizeWorld :: World -> Vector Float
vectorizeWorld w = fromList (fromIntegral <$>
  [ wfOnActiveEnemy features
  , wfShortestPathLength features
  , wfManhattanDistance features
  , wfEnemiesOnPath features
  , wfNearestEnemyDistance features
  , wfNumNearbyEnemies features
  , wfStunAvailable features
  , wfDrillsRemaining features
    features = produceWorldFeatures w

Vectorizing Outputs

Now we have the inputs to our tensor system. We'll ultimately get a vector of outputs as a result. We want this vector to provide a score for every move. We have 10 potential moves in general. There are five "movement" directions, moving up, right, down, left, and standing still. Then for each direction, we can either use our stun or not. We'll use a drill when the movement direction sends us against a wall. Certain moves won't be available in certain situations. But our size should account for them all.

Our function will often propose invalid moves. For example, it might suggest using the stun while on cooldown, or drilling when we don't have one. In these cases, our game logic should dictate that our player doesn't move. Hopefully this trains our network to make correct moves. If we wanted to, we could even apply a slight negative reward for these.

What we need is the ability to convert a vector of outputs into a move. Once we fix the vector size, this is not difficult. As a slight hack, this function will always give the same direction for moving and drilling. We'll let the game logic determine if the drill needs to apply.

moveFromOutput :: Vector Int -> PlayerMove
moveFromOutput vals = PlayerMove moveDirection useStun moveDirection
    bestMoveIndex = maxIndex vals
    moveDirection = case bestMoveIndex `mod` 5 of
      0 -> DirectionUp
      1 -> DirectionRight
      2 -> DirectionDown
      3 -> DirectionLeft
      4 -> DirectionNone
    useStun = bestMoveIndex > 4

Discrete Updates

Now that we can get numeric vectors for everything, we need to be able to step through the world, one player move at a time. We currently have a generic update function. Depending on the time, it might step the player forward, or it might not. We want to change this so there are two steps. First we receive a player's move, and then we step the world forward until it is time for the next player move:

stepWorld :: PlayerMove -> World -> World

This isn't too difficult; it just requires a little shuffling of our existing code. First, we'll add a new applyPlayerMove' function. This will take the existing applyPlayerMove and add a little bit of validation to it:

applyPlayerMove' :: PlayerMove -> World -> World
applyPlayerMove' move w = if isValidMove
  then worldAfterMove
  else w
    player = worldPlayer w
    currentLoc = playerLocation player

    worldAfterDrill = modifyWorldForPlayerDrill w
      (drillDirection move)

    worldAfterStun = if activateStun move
      then modifyWorldForStun worldAfterDrill
      else worldAfterDrill

    newLocation = nextLocationForMove
      (worldBoundaries worldAfterDrill Array.! currentLoc) 
      (playerMoveDirection move)

    isValidStunUse = if activateStun move
      then playerCurrentStunDelay player == 0
      else True
    isValidMovement = playerMoveDirection move == DirectionNone ||
      newLocation /= currentLoc
    isValidMove = isValidStunUse && isValidMovement

    worldAfterMove =
      modifyWorldForPlayerMove worldAfterStun newLocation

Now we'll add an updateEnvironment function. This will perform all the work of our updateFunc except for moving the player.

updateEnvironment :: World -> World
updateEnvironment w
  | playerLocation player == endLocation w =
    w { worldResult = GameWon }
  | playerLocation player `elem` activeEnemyLocations =
    w { worldResult = GameLost }
  | otherwise =
    updateWorldForEnemyTicks .
    updateWorldForPlayerTick .
    updateWorldForEnemyMoves .
    clearStunCells .
    incrementWorldTime $ w
    player = worldPlayer w
    activeEnemyLocations = enemyLocation <$>
      filter (\e -> enemyCurrentStunTimer e == 0) (worldEnemies w)

Now we combine these. First we'll make the player's move. Then we'll update the environment once for each tick of the player's "lag" time.

stepWorld :: PlayerMove -> World -> World
stepWorld move w = execStateM (sequence updateActions) worldAfterMove
    worldAfterMove = applyPlayerMove' move w
    updateActions = replicate
      ( fromIntegral .
        lagTime .
        playerGameParameters .
        worldParameters $ w)
      (modify updateEnvironment)

And these are all the modifications we'll need to get going!

Q Learning Teaser

Now we can start thinking about the actual machine learning process. We'll get into a lot more detail next week. But for now, let's think about a particular training iteration. We'll want to use our existing network to step forward into the game. This will produce a certain "reward", and leave the game state in a new position. Then we'll get more values for our next moves out of that position. We'll use the updated move scores and the reward to learn better values for our function weights.

Of course, the immediate "reward" values for most moves will be 0. The only moves that will carry a reward will be those where we either win the game or lose the game. So it could take a while for our program to learn good behaviors. It will take time for the "end" behaviors of the game to affect normal moves. For this reason, we'll start our training on much smaller mazes than the primary game. This should help speed up the training process.


Next week, we'll take our general framework for Q-Learning and apply it within Tensor Flow. We'll get the basics of Q-Learning down with a couple different types of models. For a wider perspective on Haskell and AI problems, make sure to check out our Haskell AI Series!