Skip to main content

Reinforcement Learning

This is part 4 of a 9 part series on Machine Learning. Our goal is to provide you with a thorough understanding of Machine Learning, different ways it can be applied to your business, and how to begin implementations of Machine Learning within your organization through the assistance of Untitled. This series is not by any means limited to only those with a technical pedigree. The objective is to provide a volume of content that will be informative and practical for a wide array of readers. We hope you enjoy and please do not hesitate to reach out with any questions. To start from part 1, please click here.

What Is Reinforcement Learning?

Machine Learning can be broken out into three distinct categories: supervised learning, unsupervised learning, and reinforcement learning.

Reinforcement learning is the process by which a computer agent learns to behave in an environment that rewards its actions with positive or negative results. When we say a “computer agent” we refer to a program that acts on its own or on behalf of a user autonomously. This is a very different type of Machine Learning then supervised learning and unsupervised learning, however, it will probably feel the most familiar because this is how humans learn. Of the three categories of Machine Learning, reinforcement learning most deeply mimics human cognition.

reinforcement learning

Human Cognition

So how do humans learn? Let’s take a simple situation most of us probably had during our childhood. Prior to learning anything about a stove, it was just another object in the kitchen environment. Kitchens are filled with various items and instruments that have little to no meaning behind our initial understanding. As a child, these items acquire a meaning to us through interaction. Prior to knowing what the utility of the item is, we gauge if it is a threat or harmful to our presence by interacting with it. This takes the form of categorizing the experience as positive or negative based upon the outcome of our interaction with the item.

For myself, I was one of the kids that learned a stove is hot through touch. By touching the stove, I received a negative output from interacting with it. The item harmed me, so I learned not to touch it. However, I witnessed my parents interact with the stove, but they did not receive harm from the situation and I learned that a stove is a useful utility. But, only when cautiously used in interaction. My learning that the stove was hot and not to touch it came from experiential learning. This scenario provides a foundation for how a reinforcement machine learning algorithm works.

Reinforcement Learning: A Different Type of ML

One of the primary differences between a reinforcement learning algorithm and the supervised / unsupervised learning algorithms, is that to train a reinforcement algorithm the data scientist needs to simply provide an environment and reward system for the computer agent. No pre-requisite “training data” is required per say (think back to the financial lending example provided in post 2, supervised learning). The training is the experimental and iterative approach of running the simulation over and over again to optimize the algorithm towards a desired result.

Let’s use an example of the game of Tic-Tac-Toe. The reinforcement algorithm loop in general looks like this: A virtual environment is set up. Each unique frame of reference is referred to as a state. Prior to any engagement with the environment, the state would be S0.

reinforcement learning

The algorithm performs a finite set of prespecified operations in the state. Based on the consequences of the operations, the algorithm is rewarded positively or negatively and retains this information for the next interaction with the state and the next episode in the environment. This can loop indefinitely, or a finite amount of times predicated on the type of reinforcement learning task.

Environment ( State n > Action n > Reward n +/-  > Repeat )

In Tic-Tac-Toe the environment would be the game board, a three by three panel of squares with the goal to connect three X’s (or O’s) vertically, diagonally or horizontally. The agent can place one X during its turn, and must combat it’s opponent placing O’s (the environment would contain the fixed set of operations that can be performed e.g. the rules of the game).

State: The state can be thought of a singular frame within the environment, or a fixed moment in “time.” In the example of Tic-Tac-Toe the first state (S0) is the empty board. State1 is the first move, State2 is the second move, etc.

Reward: in the game of Tic-Tac-Toe the reward would be winning the match by having 3 X’s in a row, either horizontally, vertically or diagonally. From a logic standpoint, we would reward our computer agent a +1 for every match it won, and a -1 for every match it lost. The goal of our computer agent is to maximize towards the expected cumulative reward (e.g. as many matches won as possible, indefinitely).

Initially, our agent will probably be dismal at playing Tic-Tac-Toe compared to a human. However, over time and through a series of many matches, it will be a tough program to beat (more on computers beating humans at games later in the post).

reinforcement learning

Two Types of Reinforcement Learning Tasks

There are two types of tasks that reinforcement learning algorithms solve: episodic and continuous. Episodic tasks can be thought of as a singular scenario, such as the Tic-Tac-Toe example. The computer agent runs the scenario, completes an action, is rewarded for that action and then stops. The information from that episode is captured and we then run the simulation again, this time equipped with more information.

We do this periodically for each episode the computer agent participates in. Each iteration in the next state pulls information from the prior state. Thus, the agent can be expected to get better at the game over time as it continually optimizes towards an outcome that produces the greatest cumulative reward.

Continuous reinforcement tasks can be thought of as tasks that run recursively until we tell the computer agent to stop. A few examples of continuous tasks would be a reinforcement learning algorithm taught to trade in the stock market, or one taught to bid in the real-time bidding ad-exchange environment.

These environments have a constant stream of states, and our reinforcement learning algorithm can continue to optimize itself towards trading or bidding patterns that produce the greatest cumulative reward (e.g. money made, placements won at the lowest marginal cost, etc). Initially, the algorithm might perform poorly compared to an experienced day trader or systematic bidder. However over time, with enough experimentation, we could expect it to outperform humans in the process.

Exploration vs. Exploitation

Reinforcement learning algorithms can be taught to exhibit one or both types of experimentation learning styles. Exploration is the process of the algorithm pushing its learning boundaries, assuming more risk, to optimize towards a long-run learning goal. In an example of Tic-Tac-Toe, this could take the form of running simulations that assume more risk, or purposefully place pieces unconventionally to learn the outcome of a given move. Armed with a greater possibility of maneuvers, the algorithm becomes a much more fierce opponent to match against.

Exploitation is the process of the algorithm leveraging information it already knows to perform well in an environment with short term optimization goals in mind. Back to the Tic-Tac-Toe example, the algorithm would perform the same moves it knew to produce a nominal probability of winning.

However, in this learning mode, the ML algorithm will not develop beyond elementary sophistication. It is usually a hybrid of exploration and exploitation styles that produces the optimal algorithm. In the same way that a human must branch out of comfort zones to increase their breadth of learning, but at the same time cultivate their given resources to increase their depth of learning.


In October 2015, for the first time ever, a computer program named AlphaGo beat a Go professional at the game. Go is considered to be one of the most complex board games ever invented. The range of possibilities for laying pieces on the board and potential strategies far exceeds a game like Chess. The A.I. researchers that brought AlphaGo to life had a simple thesis. Instead of using a supervised or unsupervised ML algorithm where they would need to provide numerous amounts of training data to the algorithm (e.g. pre-defined moves, potential game scenarios, etc.) they picked a reinforcement learning algorithm. AlphaGo essentially played against itself over and over again on a recursive loop to understand the mechanics of the game. Once it had performed enough episodes, it began to compete against top Go players from around the world.

To become a level 9 Go dan (the highest professional accolade in the game) can take a human a lifetime, with many professionals never crossing this threshold. However, AlphaGo, upon beating Mr. Lee Sedol (considered one of the best Go players in the last decade) received such prestige. This fete was a huge leap in the advancement for the field of Machine Learning, and had strong implications for the future of A.I.

In Summary

We hoped you enjoyed this post, and will continue on to part 5 deep learning and neural networks. In the next post, we’ll be tying all three categories of Machine Learning together into a new and exciting field of data analytics. Deep learning uses ML in a way that mimics the human brain, and its neurons, allowing us to cross the threshold of advanced computation into the realm of true artificial intelligence. If you are interested in starting on a Machine Learning project today or would like to learn more about how Untitled can assist your company with data analytics strategies, please reach out to us through the contact form.

Check out part five of this series.

Contact Us
Aaron Peabody

Author Aaron Peabody

More posts by Aaron Peabody