Using Our Data with Supervised Learning

supervised_learning.png

Our aim these last couple weeks has been to try a supervised learning approach to our game. In last week's article we gathered training data for playing the game. We had two different sources. First, we played the game ourselves and recorded our moves. Second, we let our AI play the game and recorded it. This gave us a few CSV files. Each line in these is a record containing the 40 "features" of the game board at the time and the move we chose.

This week, we're going to explore how to build a machine-learning agent based on this data. This will use supervised learning techniques . Look at the supervised-learning branch on our Github repository for more details.

To get started with Haskell and Tensor Flow, download our Haskell Tensor Flow Guide. This library is a little tricky to work with, so you want to make sure you know what you're doing!

Defining Our Model

For our supervised model, we're going to use a fully connected neural network with a single hidden layer. We'll have 40 input features, 100 hidden units, and then our 10 output values for the different move scores. We'll be following a very similar pattern to one we explored in this older article, using the basic Iris data set. We'll copy a lot of code from that article. We won't go over a lot of the helper code in this article, so feel free to check out that one for some help with that!

We define each layer with a "weights" matrix and a "bias" vector. We multiply the input by the weights and then add the bias vector. Let's explore how we can build a single layer of the network. This will take the input and output size, as well as the input tensor. It will have three results. One variable for the weights, one for the biases, and then a final "output" tensor:

buildNNLayer :: Int64 -> Int64 -> Tensor v Float
  -> Build (Variable Float, Variable Float, Tensor Build Float)

The definition is pretty simple. We'll initialize random variables for the weights and bias. We'll produce the result tensor by multiplying by the weights and adding the bias.

buildNNLayer :: Int64 -> Int64 -> Tensor v Float
  -> Build (Variable Float, Variable Float, Tensor Build Float)
buildNNLayer inputSize outputSize input = do
  weights <- truncatedNormal (vector [inputSize, outputSize])
    >>= initializedVariable
  bias <- truncatedNormal (vector [outputSize])
    >>= initializedVariable
  let results = (input `matMul` readValue weights)
        `add` readValue bias
  return (weights, bias, results)

Now that we understand the layers a little better, it's easier to define our model. First, we'll want to include both sets of weights and biases in the model, so we can output them later:

data Model = Model
  { w1 :: Variable Float
  , b1 :: Variable Float
  , w2 :: Variable Float
  , b2 :: Variable Float
  ...
  }

Now we want two different "steps" we can run. The training step will take a batch of data and determine what our network produces for the inputs. It will compare our network's output with the expected output. Then it will train to minimize the loss function. The error rate step will simply produce the error rate on the given data. That is, it will tell us what percentage of the moves we are getting correct. Both of these will be Session actions that take two inputs. First, the TensorData for the features, and then the TensorData for the correct moves:

data Model = Model
  { 
  … -- Weights and biases
  , train :: TensorData Float
          -> TensorData Int64
          -> Session ()
  , errorRate :: TensorData Float
              -> TensorData Int64
              -> Session (V.Vector Float) -- Produces a single number
  }

Let's see how we put this all together.

Building Our Model

To start, let's make placeholders for our input features and expected output results. A dimension of -1 means we can provide any size we like:

createModel :: Build Model
createModel = do
  let batchSize = -1
  (inputs :: Tensor Value Float) <-
    placeholder [batchSize, moveFeatures]
  (outputs :: Tensor Value Int64) <-
    placeholder [batchSize]
  ...

Now we build the layers of our neural network using our helper. We'll apply relu, an activation function, on the results of our hidden layer. This helps our model deal with interaction effects and non-linearities:

createModel :: Build Model
createModel = do
  ...
  (hiddenWeights, hiddenBiases, hiddenResults) <-
    buildNNLayer moveFeatures hiddenUnits inputs
  let rectifiedHiddenResults = relu hiddenResults
  (finalWeights, finalBiases, finalResults) <-
    buildNNLayer hiddenUnits moveLabels rectifiedHiddenResults
  ...

Now to get our error rate, we need a couple steps. We'll get the best move from each predicted result using argMax. We can then compare these to the training data using equal. By using reduceMean we'll get the percentage of our moves that match. Subtracting this from 1 gives our error rate:

createModel :: Build Model
createModel = do
  ...
  (actualOutput :: Tensor Value Int64) <- render $
    argMax finalResults (scalar (1 :: Int64))
  let (correctPredictions :: Tensor Build Float) = cast $
        equal actualOutput outputs
  (errorRate_ :: Tensor Value Float) <- render $
    1 - (reduceMean correctPredictions)

Now we need our training step. We'll compare outputs. This involves the softmaxCrossEntropyWithLogits function. We train our model by selecting our variables for training, and using minimizeWith. This will update the variables to reduce the value of the loss function:

createModel :: Build Model
createModel = do
  ...
  let outputVectors = oneHot outputs (fromIntegral moveLabels) 1 0
  let loss = reduceMean $ fst $
    softmaxCrossEntropyWithLogits finalResults outputVectors
  let params =
        [hiddenWeights, hiddenBiases, finalWeights, finalBiases]
  train_ <- minimizeWith adam loss params
  ...

We conclude by creating our functions. These take the tensor data as parameters. Then they use runWithFeeds to put the data into our placeholders:

createModel :: Build Model
createModel = do
  ...
return $ Model
    { train = \inputFeed outputFeed ->
        runWithFeeds
          [ feed inputs inputFeed
          , feed outputs outputFeed
          ]
          train_
    , errorRate = \inputFeed outputFeed ->
        runWithFeeds
          [ feed inputs inputFeed
          , feed outputs outputFeed
          ]
          errorRate_
    , w1 = hiddenWeights
    , b1 = hiddenBiases
    , w2 = finalWeights
    , b2 = finalBiases
    }

Running Our Tests

Now let's run our tests. We'll read the move record data from the file, shuffle them, and set aside a certain proportion as our test set. Then we'll build our model:

runTraining totalFile = runSession $ do
  initialRecords <- liftIO $ readRecordFromFile totalFile
  shuffledRecords <- liftIO $ shuffleM (V.toList initialRecords)
  let testRecords = V.fromList $ take 2000 shuffledRecords
  let trainingRecords = V.fromList $ drop 2000 shuffledRecords
  model <- build createModel
  ...

Then we run our iterations (we'll do 50000, as an example). We select some random records (100 per batch), and then convert them to data. Then we run our train step. Finally, every 100 iterations or so, we'll get a gauge of the training error on this set. This involves the errorRate step. Note our error rate returns a vector with a single wrapped value. So we need to unwrap it with !.

runTraining totalFile = runSession $ do
  ...
  forM_ ([0..50000] :: [Int]) $ \i -> do
    trainingSample <- liftIO $ chooseRandomRecords trainingRecords
    let (trainingInputs, trainingOutputs) =
          convertRecordsToTensorData trainingSample
    (train model) trainingInputs trainingOutputs
    when (i `mod` 100 == 0) $ do
      err <- (errorRate model) trainingInputs trainingOutputs
      liftIO $ putStrLn $
        (show i) ++ " : current error " ++ show ((err V.! 0) * 100)

Now to run the final test, we use the errorRate step again, this time on our test data:

runTraining totalFile = runSession $ do
  ...

  -- Testing
  let (testingInputs, testingOutputs) =
        convertRecordsToTensorData testRecords
  testingError <- (errorRate model) testingInputs testingOutputs
  liftIO $ putStrLn $
    "test error " ++ show ((testingError V.! 0) * 100)

Results

When it comes to testing our system, we should use proper validation techniques. We want a model that will represent our training data well. But it should also generalize well to other reasonable examples. If our model represents the training data too well, we're in danger of "overfitting" our data. To check this, we'll hold back roughly 20% of the data. This will be our "test" data. We'll train our model on the other 80% of the data. Every 100 steps or so, we print out the training error on that batch of data. We hope this figure drops. But then at the very end, we'll run the model on the other 20% of the data, and we'll see what the error rate is. This will be the true test of our system.

We know we have overfitting if we see figures on training error that are lower than the testing error. When training on human moves for 50000 iterations, the training error drops to the high teens and low 20's. But the test error is often still close to 50%. This suggests we shouldn't be training quite as long.

The AI moves provide a little more consistency though. The training error seems to stabilize around the mid 20's and low 30's, and we end up with a test error of about 34%.

Conclusion

Our error rate isn't terrible. But it's not great either. And worse, testing shows it doesn't appear to capture the behaviors well enough to win the game. A case like this suggests our model isn't sophisticated enough to capture the problem. It could also suggest our data is too noisy, and the patterns we hoped to find aren't there. The feature set we have might not capture all the important information about the graph.

For our final look at this problem, we're going to try a more new serialization technique. Instead of deriving our own features, we're going to serialize the entire game board! The "feature space" will be much much larger now. It will include the structure of the graph and information about enemies and drills. This will call for a more sophisticated model. A pure fully connected network will take a long time to learn things like how walls allow moves or not. A big drawback of this technique is that it will not generalize to arbitrary mazes. It will only work for a certain size and number of enemies. But with enough training time we may find that interesting patterns emerge. So come back next week to see how this works!