Wine Classifier
An exploration into Ensemble Classifiers

Over the course of a year at Make School, I’ve learned a whole lot about coding, web design, and data science but the most important topic that I’ve gone over the past year wasn’t programming prowess, it was product and business design and building experience. Enter the Wine Classifier which uses a scikit-learn dataset to create a number of classifier models that identify between three different types of wine from the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators.
The Motivation
Throughout this past month that I’ve been working on this project, I was also going into depth with my concentration (data science) from one of my classes therefore I knew from the beginning I wanted to work on something data science related. From logistic regression, ROC curves, cross-validation, and linear regression to dimensionality reduction and unsupervised learning, I had a lot to choose from, however there was one topic that stood out from the rest: ensemble modeling.

According to the bibliographic database ScienceDirect, “ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets.”¹ The idea of creating an ensemble model using Random Forest models — which in and of itself is an ensemble model made up of a multitude of Decision Tree models — and Support Vector Machine (SVM) models combined using Voting models really piqued my interest while I was learning about them. Out of the model options I could have chosen to create, which included classification (predicting a label) or regression (predicting a quantity), the wine dataset fit classification.
The Classification Models
Decision Trees

The first model I explored in the project was the Decision Tree classifier. In short, “A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter.”² Much like a Binary Tree in general computer science, the Decision Tree also contains nodes, edges/branches, and leaf/terminal nodes. Unlike Binary Trees, however, each of these elements are used in a bit of a different way. General nodes ask questions that test the value of a certain attribute, for example, one of my project’s nodes might ask whether the data values (wines) in the dataset have an alcohol value of 0.557 or less. Edges/branches are organized by the outcome of the test question from general nodes. Referring back to the last example, the data values (wines) that have an alcohol value of 0.557 or less would go one direction while the values that have an alcohol value greater than 0.557 would go the other direction. Leaf nodes predict the outcome of the classifier instead of simply holding a data value, such as class_0, class_1, and class_2. These terminal nodes essentially summarize the conclusions of the Decision Tree model.

To add to the Decision Tree classifier, I also created a confusion matrix to help analyze the performance of the model. Confusion matrices describe in a visual manner how many data entries were correctly predicted as well as how many data entries were incorrectly predicted during the model’s testing phase. For instance, the confusion matrix to the left tells us that our model could correctly predict 14 out of 14 wines in class_0, 12 out of 14 wines in class_1, and 5 out of 8 wines in class_2. At the end of the analysis the final accuracy rating of the model was only 86%.
Random Forest
Although the accuracy rating was a relatively average 86%, the main reason my first model was a Decision Tree classifier was to compare its performance with the much better Random Forest model. Random Forest models are one of the most used machine learning algorithms for regression or classification analysis by creating a “forest” of Decision Tree models. According to Niklas Donges, a blogger for the ‘Towards Data Science’ publication, “Put simply: random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.”³ By creating a large sample size of multiple learning models, it increases the overall accuracy of the final model much like how asking for advice from multiple people can end up being more accurate than asking for advice from just one individual.

From the Random Forest classifier’s report, we can see that the accuracy of the model was 96%, much higher than the previous, single Decision Tree classifier. The other important numbers to look at from the image include precision and recall. Precision describes the percentage of correct guesses out of total guesses made onto that target class while recall describes the percentage of correct guesses out of total true elements in that target class.
Support Vector Machine (SVM)
The next model I created, which is generally even more accurate than the Random Forest model, was the Support Vector Machine model. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.⁴

Above is an example of the Support Vector Machine algorithm working on a data set of two classes. It tries to select the hyper-plane that segregates the two classes the best by maximizing the distances between the nearest data point in either class which is called the margin. The distances are shown in the example by the double pointed arrows and stops at margin #3 because the distances between the nearest data point in either class are at its maximum.

Going back to my project, the SVM classifier that I created essentially did the same thing as the example, but with three different classes. The report shows that the model reached an accuracy of an almost perfect 99%, an even better result than the Random Forest model!
Voting
The last step of the project after creating all of these different models was to combine them all into formulating a hard voting ensemble which involves summing the votes for all the class labels from the other models and predicting the class with the most votes.

The voting classifier results shown above explain that the Ensemble model surprisingly performed similarly to the Random Forest model with an accuracy rating of ~96% while the Support Vector Classifier (SVC) model performed the highest with ~99% accuracy. This model only showcases that the voting ensemble is not guaranteed to provide better performance than any single model used in the ensemble and if any given model used in the ensemble performs better than the voting ensemble, then that model, in this case the SVC model, should probably be used instead of the voting ensemble.
The Challenges
During my time working on this project, I was still in the middle of learning the concepts that I eventually implemented. The fact that the SVC model outperformed the voting ensemble was an unforeseen, but welcome, circumstance and helped me in understanding the nuances of each of these models that I didn’t know about until completing this project and researching more about the situation. There were times that I was feeling a bit lost and unsure of my decision to create this project, but It helped that I had a list of deadlines that I created and stuck to while working on this project in order to guide me back on track at times. Lastly, it was from being pushed and supported by my instructors at Make School that led me to completing this project so although I did all of the work individually, I received a lot of help from them.
Future Plans
Although this project’s main objective to analyze scikit-learn’s wine dataset with classification models was completed, there are still many more things that I could do to further my understanding such as analyzing a different scikit-learn dataset and going back and applying the other topics I learned over the term that I listed before: logistic regression, ROC curves, linear regression, dimensionality reduction, and unsupervised learning. Even with this project alone, I plan on creating a web application that would make it easier to read the analysis provided by these models or even creating an application that visualizes the insides of these models to further understand how each model works, as it works. There are literally an infinite number of possibilities to further analyze these sets of data with the power of data science… but for now I’ll stick with what I’m learning.
[1]: Vijay Kotu. (2015). Data Mining Process https://www.sciencedirect.com/topics/computer-science/ensemble-modeling
[2]: Afroz Chakure. (July 5 2019). https://medium.com/swlh/decision-tree-classification-de64fc4d5aac
[3]: Niklas Donges. (July 13 2021). https://builtin.com/data-science/random-forest-algorithm
[4]: Sunil Ray. (Sept 13 2017). https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/