This is part 2 of a 9 part series on Machine Learning. Our goal is to provide you with a thorough understanding of Machine Learning, different ways it can be applied to your business, and how to begin implementations of Machine Learning within your organization through the assistance of Untitled. This series is not by any means limited to only those with a technical pedigree. The objective is to provide a volume of content that will be informative and practical for a wide array of readers. We hope you enjoy and please do not hesitate to reach out with any questions. To start from part 1, please click here.
What Is Supervised Learning?
Machine Learning can be broken out into three distinct categories: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning can be thought of exactly the way that it sounds, a stylization of Machine Learning directed (supervised) by a human being. With supervised learning, the desired goal or target variable is already known, and our job is to train the Machine Learning model to be able to predict that target variable with a high degree of confidence.
Before we go any further, there are quite a few terms in the prior paragraph that we should define. For the sake of easy reading, I have italicized the terms. First, what is a target variable? A target variable is the desired output or what we hope to be able to predict. In the case of a machine learning model built for the financial industry, an example target variable would be the probability of someone defaulting on a loan.
By taking historical data of individuals who have and who have not defaulted on a loan, we can uncover the weight of certain variables in the model that contribute to a default. Variables for this example would include items such as length of credit history, or the amount being borrowed.
Once we have figured out the weights of the given variables, or in general have discovered which variables matter for producing a likely default, we can begin to train our Machine Learning algorithm with more sample historical data to tune the model. Tune means to improve its prediction accuracy. A Machine Learning model is the configuration of the ML algorithm we are using to predict outputs.
The goal of the model in this example is to produce predictions about future loan applicants. We tune Machine Learning models to increase our confidence in predicting good and bad lending opportunities. Confidence can very simply be thought of as how certain we are about a given prediction. In your high school stats class, you were probably introduced to the concept of a confidence interval, or a having a certain degree/percentile of confidence. For example, we are certain of this prediction at a minimum 95% confidence interval.
Conducting Supervised Learning
In order to conduct a supervised learning project, you’ll need well structured and labeled data, and a chosen target variable. The outcomes of these problems are controlled and known, and we are simply trying to get the algorithm to perform automated predictions with a high degree of certainty.
Unlike clustering algorithms used in unsupervised problems where we might not know what the target variable we desire to predict is (more on this in the next blog post), in supervised learning we have a fairly certain idea of what the data is going to tell us and how it converges to produce outputs within the model. Typically speaking in supervised learning problems, correlations are already known and have been proven through simpler classification or regression models. The Machine Learning model is used to automate the process and hyper fit the curve for better accuracy.
Two of the main types of supervised learning problems are classification problems and regression problems. A binary classification problem would be figuring out what category a given output belongs to. Think of it like a true/false solution or yes/no solution where the probability is reduced to a binary result (a result that has two possible classifications). Going back to our lending example, instead of ending at factoring probability by which a default will occur (such as this individual has a 73% chance of defaulting) we could simplify the output to “should we lend to this person all variables considered: yes/no.”
Given this example, we would use our classification algorithm to pick which category the output belongs to. We could define output in accordance to the financial organization’s risk profile that anyone who has a 20% chance or lower of defaulting, mark “y” for lending.
Anyone that has above 21% chance of defaulting, mark “n” for not lending. Regression solves the problem of figuring out the given probability of someone defaulting, with all weighted variables considered. For example, length of credit history has a greater correlation with default, or not defaulting, then say family size, male or female enquirer, etc. Binary classification simplifies the output of the problem to two results, either yes or no, true or false, 1 or 0.
There are problems that could be a better fit for a multi-classification solution. Multiclass would be assigning instances into 3 or more categories. If we wanted to assign an instance to three possible categories such as the amount to lend, interest to charge, and length of the loan, a multi-classification solution would help us solve this problem. A multiclass solution could also come after a binary solution is produced in a model.
This is a bit more of a complicated problem to solve, but for example, if we can determine a yes or no regarding extending a loan, then the multiclass algorithm can help us determine the remainder of what to do with the applicant. E.g. Jane Doe has an 18% chance or less of defaulting, extend lending “y”. Based upon length of credit history, FICO score, and debt to asset ratio, extend $5,000 with daily compounding interest of 2% with a 3 year payback period. This would all reside within a three-dimensional graph of “n” possible outcomes.