Tweaks, Fixes, and Some Results

tweaks.jpg

In last week's episode of this AI series, we added random exploration to our algorithm. This helped us escape certain "traps" and local minimums in the model that could keep us rooted in bad spots. But it still didn't improve results too much.

This week we'll explore a couple more ways we can fix and improve our algorithm. For the first time, see some positive outcomes. Still, we'll find our approach still isn't great.

To get started with Tensor Flow and Haskell, download our guide! It's a complex process so you'll want some help! You should also check out our Haskell AI Series to learn more about why Haskell is a good choice as an AI language!

Improvements

To start out, there are a few improvements we can make to how we do q-learning. Let's recall the basic outline of running a world iteration. There are three steps. We get our "new" move from the "input" world. Then we apply that move, and get our "next" move against the "next" world. Then we use the possible reward to create our target actions, and use that to train our model.

runWorldIteration model = do
  (prevWorld, _, _) <- get

  -- Get the next move on the current world (with random chance)
  let inputWorldVector = … -- vectorize prevWorld
  currentMoveWeights <- lift $ lift $
    (iterateWorldStep model) inputWorldVector
  let bestMove = moveFromOutput currentMoveWeights
  let newMove = chooseRandomMoveWithChance …

  -- Get the next world using this move, and produce our next move
  let nextWorld = stepWorld newMove prevWorld
  let nextWorldVector = vectorizeWorld nextWorld
  nextMoveVector <- lift $ lift $
    (iterateWorldStep model) nextWorldVector

  -- Use these to get "target action values" and use them to train!
  let (bestNextMoveIndex, maxScore) =
          (V.maxIndex nextMoveVector, V.maximum nextMoveVector)
  let targetActionData = encodeTensorData (Shape [10, 1]) $
          nextMoveVector V.//
            [(bestNextMoveIndex, newReward + maxScore)]
  lift $ lift $ (trainStep model) nextWorldVector targetActionData

There are a couple issues here. First, we want to substitute based on the first new move, not the later move. We want to learn from the move we are taking now, since we assess its result now. Thus we want to substitute for that index. We'll re-write our randomizer to account for this and return the index it chooses.

Next, when training our model, we should the original world, instead of the next world. That is, we want inputWorldVector instead of nextWorldVector. Our logic is this. We get our "future" action, which accounts for the game's reward. We want our current action on this world should be more like the future action. Here's what the changes look like:

runWorldIteration model = do
  (prevWorld, _, _) <- get

  -- Get the next move on the current world (with random chance)
  let inputWorldVector = … -- vectorize prevWorld
  currentMoveWeights <- lift $ lift $
    (iterateWorldStep model) inputWorldVector
  let bestMove = moveFromOutput currentMoveWeights
  let (newMove, newMoveIndex) = chooseRandomMoveWithChance …

  -- Get the next world using this move, and produce our next move
  let nextWorld = stepWorld newMove prevWorld
  let nextWorldVector = vectorizeWorld nextWorld
  nextMoveVector <- lift $ lift $
    (iterateWorldStep model) nextWorldVector

  -- Use these to get "target action values" and use them to train!
  let maxScore = V.maximum nextMoveVector
  let targetActionData = encodeTensorData (Shape [10, 1]) $
          nextMoveVector V.//
            [(newMoveIndex, newReward + maxScore)]
  lift $ lift $ (trainStep model) inputWorldVector targetActionData

Another change we can make is to provide some rewards based on whether the selected move was legal or not. To do this, we'll need to update the stepWorld game API to return this boolean value:

stepWorld :: PlayerMove -> World -> (World, Bool)

Then we can add a small amount (0.01) to our reward value if we get a legal move, and subtract this otherwise.

As a last flourish, we should also add a timeout condition. Our next step will be to test on simple mazes that have no enemies. This means we'll never get eaten, so we need some loss condition if we get stuck. This timeout condition should have the same negative reward as losing.

Results

Now that we've made some improvements, we'll train on a very basic maze that's only 5x5 and has no walls and no enemies. Whereas we used to struggle to even finish this maze, we now achieve the goal a fair amount of the time. One of our training iterations achieved the goal around 2/3 of the time.

However, our bot is still useless against enemies! It loses every time if we try to train from scratch on a map with a single enemy. One attempt to circumvent this is to first train our weights to solve the empty maze. Then we can start with these weights as we attempt to avoid the enemy. That way, we have some pre-existing knowledge, and we don't have to learn everything at once. Still though, it doesn't result in much improvement. Typical runs only succeeded 40-50 times out of 2000 iterations.

Limiting Features

One conclusion we can draw is that we actually have too many features! Our intuition is that a larger feature set would take more iterations to learn. If the features aren't chosen carefully, they'll introduce noise.

So instead of tracking 8 features for each possible direction of movement, let's stick with 3. We'll see if the enemy is on the location, check the distance to the end, and count the number of nearby enemies. When we do this, we get comparable results on the empty maze. But when it comes to avoiding enemies, we do a little better, surviving 150-250 iterations out of 2000. These statistics are all very rough, of course. If we wanted a more thorough analysis, we'd use multiple maze configurations and a lot more runs using the finalized weights.

Conclusions

We can't draw too many conclusions from this yet. Our model is still failing to solve simple versions of our problem. It's quite possible that our model is too simplistic. After all, all we're doing is a simple matrix multiplication on our features. In theory, this should be able to solve the problem, but it may take a lot more iterations. The results stream we see also suggests local minimums are a big problem. Logging information reveals that we often die in the same spot in the maze many times in a row. The negative rewards aren't enough to draw us out, and we are often relying on random moves to find better outcomes.

So next week we're going to start changing our approach. We'll explore a way to introduce supervised learning into our process. This depends on "correct" data. We'll try a couple different ways to get that data. We'll use our own "human" input, as well as the good AI we've written in the past to solve this problem. All we need is a way to record the moves we make! So stay tuned!