Looking Ahead with More Steps!
In last week's article, we set ourselves up to make our agent use temporal difference learning. But TD is actually a whole family of potential learning methods we can use. They intertwine with other concepts in a bigger category of reinforcement learning algorithms.
In this article, we'll consider a couple possible TD approaches. We'll also examine a bit of theory surrounding other reinforcement learning topics.
For a more high level overview of using Haskell and AI, take a look at our Haskell AI Series! This series will also help you better grasp some of the basics of TensorFlow.
One Step Temporal Difference
Temporal difference learning has one general principle. The evaluation of the current game position should be similar to the evaluation of positions in the future. So at any given step we have our "current" evaluation. Then we have a "target" evaluation based on the future. We want to train our network so that the current board gets evaluated more like the target value.
We can see this in the way we defined our model. The tdTrainStep
takes two different values, the target evaluation and the current evaluation.
data TDModel = TDModel
{ …
, tdTrainStep :: TensorData Float -> TensorData Float -> Session ()
}
And in fact, doing this calculation isn't so different from what we've done before. We'll take the difference between these evaluations, square it, and use reduceSum
. This gives our loss function. Then we'll have TensorFlow minimize
the loss function.
createTDModel :: Session TDModel
createTDModel = do
...
-- Train Model
targetEval <- placeholder (Shape [1])
currentEval <- placeholder (Shape [1])
let diff = targetEval `sub` currentEval
let loss = reduceSum (diff `mul` diff)
trainer <- minimizeWith
adam loss [hiddenWeights, hiddenBias, outputWeights, outputBias]
let trainStep = \targetEvalFeed currentEvalFeed ->
runWithFeeds [feed targetEval targetEvalFeed, feed currentEval currentEvalFeed] trainer
return $ TDModel
{ ...
, tdTrainStep = trainStep
}
Let's now recall how we got our target value last week. We looked at all our possible moves, and used them to advance the world one step. We then took the best outcome out of those, and that was our target value. Because we're advancing one step into the world, we call this "one-step" TD learning.
Adding More Steps
But there's no reason we can't look further into the future! We can consider what the game will look like in 2 moves, not just one move! To do this, let's generalize our function for stepping forward. It will be stateful over the same parameters as our main iteration function. But we'll call it in a way so that it doesn't affect our main values.
We'll make one change to our approach from last time. If a resulting world is over, we'll immediately put the "correct" evaluation value. In our old approach, we would apply this later. Our new function will return the score from advancing the game, the game result, and the World
at this step.
advanceWorldAndGetScore :: Float -> TDModel
-> StateT (World, StdGen) Session (Float, GameResult, World)
advanceWorldAndGetScore randomChance model = do
(currentWorld, gen) <- get
let allMoves = possibleMoves currentWorld
let newWorlds = fst <$> map ((flip stepWorld) currentWorld) allMoves
allScoresAndResults <- Data.Vector.fromList <$>
(forM newWorlds $ \w -> case worldResult w of
GameLost -> return (0.0, GameLost)
GameWon -> return (1.0, GameWon)
GameInProgress -> do
let worldData = encodeTensorData
(Shape [1, inputDimen]) (vectorizeWorld8 w)
scoreVector <- lift $ (tdEvaluateWorldStep model) worldData
return $ (Data.Vector.head scoreVector, GameInProgress))
let (chosenIndex, newGen) = bestIndexOrRandom
allScoresAndResults gen
put (newWorlds !! chosenIndex, newGen)
let (finalScore, finalResult) = allScoresAndResults ! chosenIndex
return $ (finalScore, finalResult, newWorlds !! chosenIndex)
where
-- Same as before, except with resultOrdering
bestIndexOrRandom :: Vector (Float, GameResult) -> StdGen
-> (Int, StdGen)
...
-- First order by result (Win > InProgress > Loss), then score
resultOrdering :: (Float, GameResult) -> (Float, GameResult)
-> Ordering
...
Now we'll call this from our primary iteration function. It seems a little strange. We unwrap the World
from our state only to re-wrap it in another state call. But it will make more sense in a second!
runWorldIteration :: Float -> TDModel
-> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
(currentWorld, gen) <- get
((chosenNextScore, finalResult, nextWorld), (_, newGen)) <-
lift $ runStateT
(advanceWorldAndGetScore randomChance model)
(currentWorld, gen)
So at the moment, our code is still doing one-step temporal difference. But here's the key. We can now sequence our state action to look further into the future. We'll then get many values to compare for the score. Here's what it looks like for us to look two moves ahead and take the average of all the scores we get:
runWorldIteration :: Float -> TDModel
-> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
(currentWorld, gen) <- get
let numSteps = 2
let repeatedUpdates = sequence $ replicate numSteps
(advanceWorldAndGetScore randomChance model)
(allWorldResults, (_, newGen)) <- lift $
runStateT repeatedUpdates (currentWorld, gen)
let allScores = map (\(s, _, _) -> s) allWorldResults
let averageScore = sum allScores / fromIntegral (length allScores)
let nextScoreData = encodeTensorData
(Shape [1]) (Data.Vector.singleton averageScore)
...
When it comes to continuing the function though, we only consider the first world and result:
runWorldIteration :: Float -> TDModel
-> StateT (World, StdGen) Session Bool
runWorldIteration randomChance model = do
let (_, result1, nextWorld1) = Prelude.head allWorldResults
put (nextWorld1, newGen)
case result1 of
GameLost -> return False
GameWon -> return True
GameInProgress -> runWorldIteration randomChance model
We could take more steps if we wanted! We could also change how we get our target score. We could give more weight to near-future scores. Or we could give more weight to scores in the far future. These are all just parameters we can tune now. We can now refer to our temporal difference algorithm as "n-step", rather than 1-step.
Monte Carlo vs. Dynamic Programming
With different parameters, our TD approach can look like other common learning approaches. Dynamic Programming is an approach where we adjust our weights after each move in the game. We expect rewards for a particular state to be like those of near-future states. We use the term "bootstrapping" for "online" learning approaches like this. TD learning also applies bootstrapping.
However, dynamic programming requires that we have a strong model of the world. That is, we would have to know the probability of getting into certain states from our current state. This allows us to more accurately predict the future. We could apply this approach to our maze game on a small enough grid. But the model size would increase exponentially with the grid size and enemies! So our approach doesn't actually do this! We can advance the world with a particular move, but we don't have a comprehensive model of how the world works.
In this regard, TD learning is more like Monte Carlo learning. This algorithm is "model free". But it is not an online algorithm! We must play through an entire episode of the game before we can update the weights. We could take our "n-step" approach above, and play it out over the course of the entire game. If we then chose to provide the full weighting to the final evaluation, our model would be like Monte Carlo!
In general, the more steps we add to our TD approach, the more it approximates Monte Carlo learning. The fewer steps we have, the more it looks like dynamic programming.
TD Lambda
TD Gammon, the algorithm we mentioned last time, uses a variation of TD learning called "TD Lambda". It involves looking both forward in time as well as backwards. It observes that the best solutions lie between the extremes of one-step TD and Monte Carlo.
Academic literature can help give a more complete picture of machine learning. One great text is Reinforcement Learning, by Sutton and Barto. It's one of the authoritative texts on the topics we've discussed in this article!
What's Next
This concludes our exploration of AI within the context of our Maze Game. We'll come back to AI and Machine Learning again soon. Next week, we'll start tackling a new subject in the realm of functional programming, something we've never looked at before on this blog! Stay tuned!