We previously discussed a statistical method to optimize the performance of an ensemble of decision trees: bagging and random forests. These algorithms train CART models in parallel and then aggregate the results in the end.
However, another subfamily of tree algorithms called boosting takes an entirely different approach by training several trees sequentially. These trees are weak learners - they are trees with single splits. However, the knowledge aggregated from these weak learners will be used to tweak model weights to force the model to learn better over time.
We’ll discuss what boosting is and the different flavors of boosting in this tutorial.
Boosting optimizes a naive tree model, iteratively correcting the weights, and improves the model over each iteration that can be explained in simple words like boosting is a method or collection of algorithms that help to make a weak learner stronger.
Which means the boosting method can able to create a strong classifier tree from a number of weak classifier trees by building the weak trees in series. In boosting the making of a tree starts with the training dataset.
From the available training set, the boosting method creates the first tree or model, which may have many errors. Secondly, the boosting technique makes the second model which helps to clear the errors of the first model. Likewise, the third model is added and it goes on until the prediction is perfect without any errors for all dataset or maximum number of models or trees are added.
Now let us consider an example for understanding the boosting concept clearly. Let us take an example of checking if a message is spam or not spam. For checking an message is spam or not we make some rules for sort a message for example
Now we have five conditions for checking if a message is SPAM or not. Did you believe these conditions individually only enough to check for a message is spam or not. The answer is a NO, so we can call these conditions WEAK LEARNERS.
So boosting is coming to the picture in these scenarios where it combines the weak learners to make it perfect for giving the right output. Boosting uses these individual conditions for checking spam combines to form a strong condition for message spam check. It uses the methods like
There are different variants of the boosting algorithms. Before going to check the boosting algorithms we have to keep in mind that boosting is not a specific algorithm, it acts as a generic method that helps to improve a specific model. In boosting, we have to prefer which model we are going to use like regression or decision trees which boosting needs to improve.
We will go over two traditional methods for classification and regression and a modern boosting approach optimized for performance.
Adaptive boosting or AdaBoost was the original boosted classifier that selects features that improve the model’s predictive power. AdaBoost works by constructing decision stumps - individual decision trees with a single split, as shown above with each of the individual trees.
As you can see, the individual stumps are not great classifiers. However, incorrectly classified data points are weighted in the AdaBoost scheme. At the end of constructing several of these trees, the weights are added from these weak learners. The final model built from all of these weak learners can find complex decision boundaries that can accurately classify data points.
From the picture above check the tree 1 where there are equally assigned weights for each data point like plus and minus and we can also see a decision stump represented by a straight line that classifies the data points.
From the picture we can understand that the classification is not proper as not all the plus are inside the decision stump. Now we add higher weights to the plus and make another decision stump.
Note check the picture 2 where we add more weights for the plus and draw another decision stump but it is also not proper prediction as some minus are also inside the stump it may cause classification errors. So we add some weights to these minus and draw another decision stump.
Likewise after picture 3, we will get the perfect classification shown in tree 4 by combining the previous three individual trees. Adaboost works on the principle of combining the weak individual learners to make a strong learner.
Gradient boosting is another boosting variant. However, the main difference between AdaBoost and Gradient Boosting is that AdaBoost identifies model errors by adding a penalty value to the incorrectly classified data points. However, gradient boosting uses gradient descent to correct the model weights.
Another minor point about the difference between AdaBoost and gradient boosting is the loss function. AdaBoost minimizes the exponential loss function, while gradient boosting can use any differentiable loss function. The exponential loss function is prone to outliers; thus gradient boosting is an ideal boosting algorithm for noisy data.
While gradient boosting is a powerful method, its original implementation was not optimized for real-world applications. Thus, eXtreme Gradient Boosting (XGBoost) was created to make gradient boosting accessible towards a wide variety of problems.
XGBoost is simply gradient boosting with several improvements that allow it to run faster and more accurately.
Several of these improvements are listed below, including:
XGBoost is a robust out-of-the-box algorithm to get started with many complex data science problems.