Commit fab3f0de authored by Harry Pigot's avatar Harry Pigot
Browse files


\ No newline at end of file
# Reinforcement Learning with an Inverted Pendulum using Open AI Gym
Here's a quick overview of the outcomes. Details and instructions are given in the Jupyter notebook.
- save plots and best models
- rerun and save disc training results plots
- rename file and folders (update image save locations)
- gitlab repo
## Discrete Action Inverted Pendulum Environment
The `CartPole-v1` environment gives a reward of 1 for every step that the pendulum is upright (+- 15 degrees) and visible in the simulation (position +-2.4). I took an "extended rewards" approach by adding penalties for the pendulum angle and cart position.
Compared 4 models:
- NN Basic: trained to replicate a basic policy (move in the direction the pendulum is tiliting)
- PG 500: trained using policy gradient with 500 step simulation limit
- PG 200 Extended: trained using policy gradient with extended rewards and 200 step limit
- PG 500 Extended: " 500 step limit
Extended rewards improved took fewer iterations and yielded higher mean steps.
The best policy found during training was saved and evaluated for the extended rewards models, but not for "PG 500" so the comparison isn't entirely fair. The extended rewards policy gave the best performance.
## Continuous Action Inverted Pendulum Environment
Continuing with the extended rewards, I modified the Policy Gradient algorithm to train policies for continuous outputs.
During training the model gives what it believes will be the `best_action`, then chooses between `best_action` and a random action with `epsilon` percent chance of choosing randomly. Then as with the discrete case, the chosen action is considered the target (correct) action, and compared to the `best_action` output by the model to calculate the loss:
best_action = model(obs[np.newaxis])
if tf.random.uniform([1, 1]) < epsilon: # take random action
action = tf.random.uniform([1, 1], minval=-1, maxval=1)
else: # truncate and use best_action
action = tf.math.maximum(env.action_space.low, best_action)
action = tf.math.minimum(env.action_space.high, action)
loss = tf.reduce_mean(loss_fn(action, best_action)) # action taken used as "target" model output
Training progress was noisy and results very sensitive to random seed. Updating with a policy gradient step after 100 or 10 episodes both yielded good policies but the 10 episode updates showed even noisier training progress. The best performing model found was saved and evaluated.
Moving from 5-10 neurons improved performance. Sensitive to random seed.
The two best models reached the maximum of 500 steps for all 500 episodes. They showed comparable results when random actions were added as disturbances (epsilon = 0.3), with the 100 episodes/update model having fewer extreme outliers.
Looking at the simulations show that the 100 episodes/update model tends to use slightly less aggressive control actions.
Without noise in the environment, the models are able to get the pendulum into a perfect upright equilibrium. It would be interesting to look at their response to noise in the environment, delays, and step disturbances in pendulum angle.
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
# gym-cartpole
A continuous cart-pole environment based on [open ai gym's CartPole-v1]( and Ian Danforth's [ContinuousCartPoleEnv](
To setup, checkout the relevant virtual env (e.g. `tf2`), navigate to the directory containing the `gym-cartpole` directory and run `pip install -e gym-cartpole`.
Then, within your python project, create an instance of the environment with `gym.make('gym_cartpole:CartPoleCont-v0')`.
\ No newline at end of file
Metadata-Version: 1.0
Name: gym-cartpole
Version: 0.0.1
Summary: UNKNOWN
Home-page: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Description: UNKNOWN
Platform: UNKNOWN
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment