This is part 3 of a 9 part series on Machine Learning. Our goal is to provide you with a thorough understanding of Machine Learning, different ways it can be applied to your business, and how to begin implementations of Machine Learning within your organization through the assistance of Untitled. This series is not by any means limited to only those with a technical pedigree. The objective is to provide a volume of content that will be informative and practical for a wide array of readers. We hope you enjoy and please do not hesitate to reach out with any questions. To start from part 1, please click here.
What Is Unsupervised Learning?
Machine Learning can be broken out into three distinct categories: supervised learning, unsupervised learning, and reinforcement learning.
Unsupervised learning is a machine learning method that utilizes data that is unstructured, not labeled, and not classified to produce outputs and insights. If you read the previous post about supervised Machine Learning, you’ll recognize that this type of ML is the exact opposite.
Unsupervised learning is produced through computation without guidance from a human. Often, humans will use unsupervised learning for big data problems that need the assistance of a computer to help derive meaning from a substantial amount of unstructured data in the data set. Whereas supervised learning is a human teaching a computer program about what the data means, unsupervised learning can be thought of quite ironically as the computer teaching us what the data says.
Unsupervised learning is becoming increasingly important due to the way information architecture functions today. ML experts and data scientists would love to live in a world where supervised learning problems were status quo (they are much easier to build, measure and interpret, in our opinion) for tackling business problems, but that is not the way the world works.
If you have any experience in data science or data engineering you’ll know what we’re talking about. Most companies information architecture is completely unstructured, poorly labeled if labeled at all, and has not been classified by any means. Thus, enters the necessity for using unsupervised learning algorithms to parse through data and help produce powerful insights from the vast amount of unstructured data in the world.
The best way to explain unsupervised learning is to use an example, similar to the financial lending anecdote used in the previous post. The classic unsupervised learning example is email spam detection. Additionally, in order to not overwhelm you with information, we’ll focus on one particular type of unsupervised Machine Learning algorithm.
Unsupervised Learning Algorithm
In a supervised learning problem, the emails will have already been classified as spam or not spam (a binary classification problem). There are supervised learning recipes for spam detection available. However, in the real world, that is not how spam works. No one sends an email, and also lets the recipient of the email know it is spam.
Additionally, the structure and stylization of spam emails changes often, so we do not want to limit the abilities of our ML model based upon the available training data we feed our algorithm; it would defeat the point. So, we must find a way to use the unstructured data of emails to produce a classification of marking the email as spam or not spam.
We can create a classification of spam or not spam by observing the commonalities that can be discovered within the text bodies of spam emails. Common variables analyzed for a spam email in the text body of the email are spelling errors, obscure vernacular, grammar inconsistencies grammar errors and promotional phrasing.
Clustering algorithms have been the go-to Machine Learning models for email spam detection. This method can be thought of as a technique that groups data together based upon common features or properties. For example, if you were given a list of 100 wines, a clustering algorithm could be run that would read the label description for each wine. Some clusters that may produced from a model like this would be a red or white classification and perhaps a spicy, sour, or sweet set of taste clusters. You could probably use your own intuition and eyes to cluster the bottles together based upon the commonalities of the wines that you discover through reading the labels. Humans are pattern recognition machines.
But what if you had to do it for 1,000,000 bottles of wine? You might be able to do it but it would take a very long time due to the redundancies of such a large computation. You also are limited in the categories you can cluster by and are prone to making mistakes as a human being, there may be cluster hierarchies that you don’t even think of. If you have ever run a Gaussian Mixture Model, you know what we’re talking about. Those algorithms are notorious for clustering delightfully weird categories within data that you might not have thought of.
When To Use It
If it is an unstructured big data problem, clustering algorithms come to the rescue. Not only are they very good at redundant tasks, but they help find hidden patterns in extremely large sets of data. So let’s go back to our email spam detection example.
If the majority of emails start with “Hi” Hello” “Hey” and “Dear” then emails that start with “Hii” “Helloer” “Heye” and “Der” could be thought of as suspect. Now, just because an email has a spelling error in the introductory line, does not mean it is spam. However, this is a variable we can take into consideration when weighing our model.
Spelling errors are an initial give away for a spam email. With that said, if we ran our algorithm against tens of thousands of emails, bucketing by spelling error in the intro line, this could be the first test in a series of clusters that inform a propensity of an email to be spam. We can also look at the common words and phrases within an email that are associated with spam, and use that to classify the emails as well.
Phrases that surround urgent money making opportunities, discounts and link loading can all be good indicators of a spam email. Additionally, when grouping phrasing detection with spelling and grammar error filters, you get an algorithm well suited for detecting spam emails at a high confidence level.
Unsupervised learning is a bit more complex to wrap your head around then supervised learning. However, given the landscape and data ecosystem of most companies, it is of the utmost importance to have unsupervised learning know-how in your data science toolbox. Businesses data is fragmented, discontinuous and messy, however unsupervised learning methods such as clustering algorithms help to clean that up.
We hope you enjoyed this post and will continue on in the series to part 4, reinforcement learning. In the next section, you’ll learn about the style of ML that produced a computer that taught itself the game “Go” and went on to beat the best Go player in the world. If you are interested in starting on a Machine Learning project today or would like to learn more about how Untitled can assist your company with data analytics strategies, please reach out to us through the contact form.
Check out part four of this series.