Q-Learning Primer
This week, we're going to take the machine learning process in a different direction than I expected. In the last couple weeks, we've built a simple evaluation function for our world state. We could learn this function using an approach called Temporal Difference learning. We might come back to this approach at some point. But for now, we're actually going to try something a little different.
Instead, we're going to focus on a technique called Q-Learning. Instead of an evaluation function for the world, we're going to learn makePlayerMove
. We'll keep most of the same function structure. We're still going to take our world and turn it into a feature space that we can represent as a numeric vector. But instead of producing a single output, we'll give a score for every move from that position. This week, we'll take the basic steps to ready our game for this approach.
As always, check out the Github repository repository for this project. This week's code is on the q-learning
branch!
Next week, we'll finally get into some Tensor Flow code. Make sure you're ready for it by reading up on our Tensor Flow Guide!
Vectorizing Inputs
To learn a function, we need to be able to represent both the inputs and the outputs of our system as numeric vectors. We've already done most of the work here. Let's recall our evaluateWorld
function. We'll keep the same feature values. But now we'll wrap them, instead of applying scores immediately:
data WorldFeatures = WorldFeatures
{ onActiveEnemy :: Int
, shortestPathLength :: Int
, manhattanDistance :: Int
, enemiesOnPath :: Int
, nearestEnemyDistance :: Int
, numNearbyEnemies :: Int
, stunAvailable :: Int
, drillsRemaining :: Int
}
produceWorldFeatures :: World -> WorldFeatures
produceWorldFeatures w = WorldFeatures
(if onActiveEnemy then 1 else 0)
shortestPathLength
manhattanDistance
enemiesOnPath
nearestEnemyDistance
numNearbyEnemies
(if stunAvailable then 1 else 0)
(fromIntegral drillsRemaining)
where
-- Calculated as before
onActiveEnemy = ...
enemiesOnPath = ...
shortestPathLength = ...
nearestEnemyDistance = ...
manhattanDistance = ...
stunAvailable = ...
numNearbyEnemies = ...
drillsRemaining = ...
Now, in our ML code, we'll want to convert this into a vector. Using a vector will enable use to encode this information as a tensor.
vectorizeWorld :: World -> Vector Float
vectorizeWorld w = fromList (fromIntegral <$>
[ wfOnActiveEnemy features
, wfShortestPathLength features
, wfManhattanDistance features
, wfEnemiesOnPath features
, wfNearestEnemyDistance features
, wfNumNearbyEnemies features
, wfStunAvailable features
, wfDrillsRemaining features
])
where
features = produceWorldFeatures w
Vectorizing Outputs
Now we have the inputs to our tensor system. We'll ultimately get a vector of outputs as a result. We want this vector to provide a score for every move. We have 10 potential moves in general. There are five "movement" directions, moving up, right, down, left, and standing still. Then for each direction, we can either use our stun or not. We'll use a drill when the movement direction sends us against a wall. Certain moves won't be available in certain situations. But our size should account for them all.
Our function will often propose invalid moves. For example, it might suggest using the stun while on cooldown, or drilling when we don't have one. In these cases, our game logic should dictate that our player doesn't move. Hopefully this trains our network to make correct moves. If we wanted to, we could even apply a slight negative reward for these.
What we need is the ability to convert a vector of outputs into a move. Once we fix the vector size, this is not difficult. As a slight hack, this function will always give the same direction for moving and drilling. We'll let the game logic determine if the drill needs to apply.
moveFromOutput :: Vector Int -> PlayerMove
moveFromOutput vals = PlayerMove moveDirection useStun moveDirection
where
bestMoveIndex = maxIndex vals
moveDirection = case bestMoveIndex `mod` 5 of
0 -> DirectionUp
1 -> DirectionRight
2 -> DirectionDown
3 -> DirectionLeft
4 -> DirectionNone
useStun = bestMoveIndex > 4
Discrete Updates
Now that we can get numeric vectors for everything, we need to be able to step through the world, one player move at a time. We currently have a generic update
function. Depending on the time, it might step the player forward, or it might not. We want to change this so there are two steps. First we receive a player's move, and then we step the world forward until it is time for the next player move:
stepWorld :: PlayerMove -> World -> World
This isn't too difficult; it just requires a little shuffling of our existing code. First, we'll add a new applyPlayerMove'
function. This will take the existing applyPlayerMove
and add a little bit of validation to it:
applyPlayerMove' :: PlayerMove -> World -> World
applyPlayerMove' move w = if isValidMove
then worldAfterMove
else w
where
player = worldPlayer w
currentLoc = playerLocation player
worldAfterDrill = modifyWorldForPlayerDrill w
(drillDirection move)
worldAfterStun = if activateStun move
then modifyWorldForStun worldAfterDrill
else worldAfterDrill
newLocation = nextLocationForMove
(worldBoundaries worldAfterDrill Array.! currentLoc)
currentLoc
(playerMoveDirection move)
isValidStunUse = if activateStun move
then playerCurrentStunDelay player == 0
else True
isValidMovement = playerMoveDirection move == DirectionNone ||
newLocation /= currentLoc
isValidMove = isValidStunUse && isValidMovement
worldAfterMove =
modifyWorldForPlayerMove worldAfterStun newLocation
Now we'll add an updateEnvironment
function. This will perform all the work of our updateFunc
except for moving the player.
updateEnvironment :: World -> World
updateEnvironment w
| playerLocation player == endLocation w =
w { worldResult = GameWon }
| playerLocation player `elem` activeEnemyLocations =
w { worldResult = GameLost }
| otherwise =
updateWorldForEnemyTicks .
updateWorldForPlayerTick .
updateWorldForEnemyMoves .
clearStunCells .
incrementWorldTime $ w
where
player = worldPlayer w
activeEnemyLocations = enemyLocation <$>
filter (\e -> enemyCurrentStunTimer e == 0) (worldEnemies w)
Now we combine these. First we'll make the player's move. Then we'll update the environment once for each tick of the player's "lag" time.
stepWorld :: PlayerMove -> World -> World
stepWorld move w = execStateM (sequence updateActions) worldAfterMove
where
worldAfterMove = applyPlayerMove' move w
updateActions = replicate
( fromIntegral .
lagTime .
playerGameParameters .
worldParameters $ w)
(modify updateEnvironment)
And these are all the modifications we'll need to get going!
Q Learning Teaser
Now we can start thinking about the actual machine learning process. We'll get into a lot more detail next week. But for now, let's think about a particular training iteration. We'll want to use our existing network to step forward into the game. This will produce a certain "reward", and leave the game state in a new position. Then we'll get more values for our next moves out of that position. We'll use the updated move scores and the reward to learn better values for our function weights.
Of course, the immediate "reward" values for most moves will be 0. The only moves that will carry a reward will be those where we either win the game or lose the game. So it could take a while for our program to learn good behaviors. It will take time for the "end" behaviors of the game to affect normal moves. For this reason, we'll start our training on much smaller mazes than the primary game. This should help speed up the training process.
Conclusion
Next week, we'll take our general framework for Q-Learning and apply it within Tensor Flow. We'll get the basics of Q-Learning down with a couple different types of models. For a wider perspective on Haskell and AI problems, make sure to check out our Haskell AI Series!