Looking Ahead with More Steps!

Nov 11

In last week's article, we set ourselves up to make our agent use temporal difference learning. But TD is actually a whole family of potential learning methods we can use. They intertwine with other concepts in a bigger category of reinforcement learning algorithms.

In this article, we'll consider a couple possible TD approaches. We'll also examine a bit of theory surrounding other reinforcement learning topics.

For a more high level overview of using Haskell and AI, take a look at our Haskell AI Series! This series will also help you better grasp some of the basics of TensorFlow.

One Step Temporal Difference

Temporal difference learning has one general principle. The evaluation of the current game position should be similar to the evaluation of positions in the future. So at any given step we have our "current" evaluation. Then we have a "target" evaluation based on the future. We want to train our network so that the current board gets evaluated more like the target value.

We can see this in the way we defined our model. The tdTrainStep takes two different values, the target evaluation and the current evaluation.

data TDModel = TDModel
  { …
  , tdTrainStep :: TensorData Float -> TensorData Float -> Session ()
  }

And in fact, doing this calculation isn't so different from what we've done before. We'll take the difference between these evaluations, square it, and use reduceSum. This gives our loss function. Then we'll have TensorFlow minimize the loss function.

createTDModel :: Session TDModel
createTDModel = do
  ...
  -- Train Model
  targetEval <- placeholder (Shape [1])
  currentEval <- placeholder (Shape [1])
  let diff = targetEval `sub` currentEval
  let loss = reduceSum (diff `mul` diff)
  trainer <- minimizeWith
    adam loss [hiddenWeights, hiddenBias, outputWeights, outputBias]
  let trainStep = \targetEvalFeed currentEvalFeed ->
        runWithFeeds [feed targetEval targetEvalFeed, feed currentEval currentEvalFeed] trainer
  return $ TDModel
    { ...
    , tdTrainStep = trainStep
    }

Let's now recall how we got our target value last week. We looked at all our possible moves, and used them to advance the world one step. We then took the best outcome out of those, and that was our target value. Because we're advancing one step into the world, we call this "one-step" TD learning.

Adding More Steps

But there's no reason we can't look further into the future! We can consider what the game will look like in 2 moves, not just one move! To do this, let's generalize our function for stepping forward. It will be stateful over the same parameters as our main iteration function. But we'll call it in a way so that it doesn't affect our main values.

We'll make one change to our approach from last time. If a resulting world is over, we'll immediately put the "correct" evaluation value. In our old approach, we would apply this later. Our new function will return the score from advancing the game, the game result, and the World at this step.

advanceWorldAndGetScore :: Float -> TDModel
  -> StateT (World, StdGen) Session (Float, GameResult, World)
advanceWorldAndGetScore randomChance model = do
  (currentWorld, gen) <- get
  let allMoves = possibleMoves currentWorld
  let newWorlds = fst <$> map ((flip stepWorld) currentWorld) allMoves
  allScoresAndResults <- Data.Vector.fromList <$>
    (forM newWorlds $ \w -> case worldResult w of
      GameLost -> return (0.0, GameLost)
      GameWon -> return (1.0, GameWon)
      GameInProgress -> do
        let worldData = encodeTensorData
              (Shape [1, inputDimen]) (vectorizeWorld8 w)
        scoreVector <- lift $ (tdEvaluateWorldStep model) worldData
        return $ (Data.Vector.head scoreVector, GameInProgress))

  let (chosenIndex, newGen) = bestIndexOrRandom
                                allScoresAndResults gen 
  put (newWorlds !! chosenIndex, newGen)
  let (finalScore, finalResult) = allScoresAndResults ! chosenIndex
  return $ (finalScore, finalResult, newWorlds !! chosenIndex)
  where
    -- Same as before, except with resultOrdering
    bestIndexOrRandom :: Vector (Float, GameResult) -> StdGen
      -> (Int, StdGen)
    ...

    -- First order by result (Win > InProgress > Loss), then score
    resultOrdering :: (Float, GameResult) -> (Float, GameResult)
      -> Ordering
    ...

Now we'll call this from our primary iteration function. It seems a little strange. We unwrap the World from our state only to re-wrap it in another state call. But it will make more sense in a second!

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get

  ((chosenNextScore, finalResult, nextWorld), (_, newGen)) <-
    lift $ runStateT
      (advanceWorldAndGetScore randomChance model)
      (currentWorld, gen)

So at the moment, our code is still doing one-step temporal difference. But here's the key. We can now sequence our state action to look further into the future. We'll then get many values to compare for the score. Here's what it looks like for us to look two moves ahead and take the average of all the scores we get:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  (currentWorld, gen) <- get

  let numSteps = 2
  let repeatedUpdates = sequence $ replicate numSteps
        (advanceWorldAndGetScore randomChance model)
  (allWorldResults, (_, newGen)) <- lift $
    runStateT repeatedUpdates (currentWorld, gen)

  let allScores = map (\(s, _, _) -> s) allWorldResults
  let averageScore = sum allScores / fromIntegral (length allScores)
  let nextScoreData = encodeTensorData
        (Shape [1]) (Data.Vector.singleton averageScore)
  ...

When it comes to continuing the function though, we only consider the first world and result:

runWorldIteration :: Float -> TDModel
  -> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
  let (_, result1, nextWorld1) = Prelude.head allWorldResults
  put (nextWorld1, newGen)
  case result1 of
    GameLost -> return False
    GameWon -> return True
    GameInProgress -> runWorldIteration randomChance model

We could take more steps if we wanted! We could also change how we get our target score. We could give more weight to near-future scores. Or we could give more weight to scores in the far future. These are all just parameters we can tune now. We can now refer to our temporal difference algorithm as "n-step", rather than 1-step.

Monte Carlo vs. Dynamic Programming

With different parameters, our TD approach can look like other common learning approaches. Dynamic Programming is an approach where we adjust our weights after each move in the game. We expect rewards for a particular state to be like those of near-future states. We use the term "bootstrapping" for "online" learning approaches like this. TD learning also applies bootstrapping.

However, dynamic programming requires that we have a strong model of the world. That is, we would have to know the probability of getting into certain states from our current state. This allows us to more accurately predict the future. We could apply this approach to our maze game on a small enough grid. But the model size would increase exponentially with the grid size and enemies! So our approach doesn't actually do this! We can advance the world with a particular move, but we don't have a comprehensive model of how the world works.

In this regard, TD learning is more like Monte Carlo learning. This algorithm is "model free". But it is not an online algorithm! We must play through an entire episode of the game before we can update the weights. We could take our "n-step" approach above, and play it out over the course of the entire game. If we then chose to provide the full weighting to the final evaluation, our model would be like Monte Carlo!

In general, the more steps we add to our TD approach, the more it approximates Monte Carlo learning. The fewer steps we have, the more it looks like dynamic programming.

TD Lambda

TD Gammon, the algorithm we mentioned last time, uses a variation of TD learning called "TD Lambda". It involves looking both forward in time as well as backwards. It observes that the best solutions lie between the extremes of one-step TD and Monte Carlo.

Academic literature can help give a more complete picture of machine learning. One great text is Reinforcement Learning, by Sutton and Barto. It's one of the authoritative texts on the topics we've discussed in this article!

What's Next

This concludes our exploration of AI within the context of our Maze Game. We'll come back to AI and Machine Learning again soon. Next week, we'll start tackling a new subject in the realm of functional programming, something we've never looked at before on this blog! Stay tuned!

GamesAI

James Bowen