From World to Environment: Open AI Gym Primer
In last week's article, we briefly entered the world of Haskell's Gloss library and illustrated our search algorithm in action. An integral part of this process was creating and using a particular World
type to store information about the state of our game and process updates.
This week we'll discuss the Open AI Gym. This framework is widely used to help people learn about AI algorithms and how to train them using machine learning techniques. It has its own concept of an "environment" that is very similar to this "World" idea. It's worth comparing these concepts, and it's especially fun to consider how to re-create the "environment" in Haskell. This is a (somewhat) novel area where type families can help us out. So read on to learn how!
You can see all the code for this series on GitHub! For this article, you should look at the Environment module. This article is also available as a video on YouTube!
Review of World Type
Let's recall our World
type from last time:
data World = World
{ playerLocation :: Location
, startLocation :: Location
, endLocation :: Location
, worldGrid :: Grid
}
This type stores all the information, both mutable and immutable, about our game. It tells us the grid that we are navigating, as well as the current location of the "player", which can change.
Our game then is largely defined by functions that operate on this world:
play :: Display -> Color -> Int
-> world
-> (world -> Picture)
-> (Event -> world -> world)
-> (Float -> world -> world)
-> IO ()
drawingFunc :: World -> Picture
inputHandler :: Event -> World -> World
updateFunc :: Float -> World -> World
We require a function to draw our world, a function to change the world based on user actions, and a function to let the world evolve naturally. And, of course, we need to create our initial world in order to start the game off.
Open Gym Environment
Now let's compare that to some code from Open AI Gym. Here's a Python snippet you can find on the Open AI Gym website:
import gym
env = gym.make("CartPole-v1")
observation = env.reset()
for _ in range(1000):
env.render()
action = env.action_space.sample() # Takes a random action
observation, reward, done, info = env.step(action)
if done:
observation = env.reset()
env.close()
Let's note how this environment is used:
We create it ("make") and can "reset" it. Resetting it produces an "observation". The environment has an "action space", a set of available actions. We can "step" the environment forward using one of these actions, producing a new observation, a reward, and a boolean indicating if we're "done". We can "render" the environment.
Like our World
, an environment clearly stores all the information about a game. But we also have this subset of information that we refer to as the "observation". This, generally speaking, is information a player of the game actually has access to, and it ought to be mutable.
Next, we "step" the game forward using one of the possible actions at a point. This is a combination of the input handler and update function from our Gloss example. When we step forward, we usually impact the world with our action, but the world also goes through its own natural evolution. This produces a new observation from which we'll base our next action.
We also see a "reward" as a result of each action. This is something unique to the environment. Rewards are very important in training any kind of algorithm using machine learning. It's how we tell our program what a "good" move is.
And of course, it's useful to be able to render or draw our environment, though this isn't strictly necessary for the game's logic.
Making a Haskell Environment
There's a distinct drawback of using Python though. The types of several of our environment-related expressions above are unclear! What information, exactly, is stored in the environment? What does an "action" look like, or an "observation"? In very basic games, both the action and observation can be simple integers, but it's tricky to make heads or tails of that.
So let's consider what an "Environment" with this kind of API would look like in Haskell. We're tempted of course, to make this a specific type. But we don't have type-level inheritance in Haskell. And we'll want to create some kind of pattern that different games can inherit from. So it's actually better to make this a typeclass. And, since our game will need to have different side effects, we'll make it a monadic typeclass:
class (Monad m) => EnvironmentMonad m where
...
And this is where the fun begins! Each different game environment will have its types associated with it, corresponding to the environment state, an action, and an observation. So we can use type family syntax to associate these types with the class variable m
:
class (Monad m) => EnvironmentMonad m where
type Observation m :: *
type Action m :: *
type EnvironmentState m :: *
...
We can use these types within our other class functions as well. For example, we should be able to produce the current observation. And given an observation, we should be able to describe the available actions. We should also be able to reset the environment.
class (Monad m) => EnvironmentMonad m where
type Observation m :: *
type Action m :: *
type EnvironmentState m :: *
currentObservation :: m (Observation m)
possibleActions :: Observation m -> m [Action m]
resetEnv :: m (Observation m)
...
Finally, we need two more items. First, our "step" function. This takes an action as input, and it produces a new observation, a reward, and a boolean, indicating that we are done. Then the last item will be more Haskell specific. This will be a "run" function. It will allow us to take an action in our monad, combined with the environment state, and turn it into a normal IO action we can run elsewhere.
newtype Reward = Reward Double
class (Monad m) => EnvironmentMonad m where
type Observation m :: *
type Action m :: *
type EnvironmentState m :: *
currentObservation :: m (Observation m)
possibleActions :: Observation m -> m [Action m]
resetEnv :: m (Observation m)
stepEnv :: (Action m) -> m (Observation m, Reward, Bool)
runEnv :: (EnvironmentState m) -> m a -> IO a
If we are interested in rendering our environment, we can make a new typeclass that inherits from our base class. It should also inherit from IO, because any kind of rendering will involve IO.
class (MonadIO m, EnvironmentMonad m) => RenderableEnvironment m where
renderEnv :: m ()
Using this class, we can write some generic code that will work on any game! Here's a couple loop functions. This first will work on any environment, though it requires we supply our own function to choose an action. This is really the "brain" of the game, which we'll get into more next time!
gameLoop ::
(EnvironmentMonad m) => m (Action m) -> m (Observation m, Reward)
gameLoop chooseAction = do
newAction <- chooseAction
(newObs, reward, done) <- stepEnv newAction
if done
then return (newObs, reward)
else gameLoop chooseAction
And if we want to render our game each time, we can just add this separate constraint, and add the extra render steps in between!
gameRenderLoop :: (RenderableEnvironment m) => m (Action m) -> m (Observation m, Reward)
gameRenderLoop chooseAction = do
renderEnv
newAction <- chooseAction
(newObs, reward, done) <- stepEnv newAction
if done
then renderEnv >> return (newObs, reward)
else gameRenderLoop chooseAction
Conclusion
So there are a lot of similarities between these two systems, but clearly Open AI Gym is a little more involved and detailed. But Haskell provides some interesting mechanisms for us to add more type-clarity around our environments.
Next week, we'll actually use this environment class and apply it to our simple Breadth-First-Search example. This will really get us started on the road to applying machine learning to this problem, so you won't want to miss out! Make sure to subscribe to Monday Morning Haskell so you can stay up to date with what's going on here!