Boosting in Machine Learning

Boosting

We previously discussed a statistical method to optimize the performance of an ensemble of decision trees: bagging and random forests. These algorithms train CART models in parallel and then aggregate the results in the end. 

However, another subfamily of tree algorithms called boosting takes an entirely different approach by training several trees sequentially. These trees are weak learners - they are trees with single splits. However, the knowledge aggregated from these weak learners will be used to tweak model weights to force the model to learn better over time.

We’ll discuss what boosting is and the different flavors of boosting in this tutorial. 

What is boosting?

Boosting

Boosting optimizes a naive tree model, iteratively correcting the weights, and improves the model over each iteration that can be explained in simple words like boosting is a method or collection of algorithms that help to make a weak learner stronger.

Which means the boosting method can able to create a strong classifier tree from a number of weak classifier trees by building the weak trees in series. In boosting the making of a tree starts with the training dataset. 

From the available training set, the boosting method creates the first tree or model, which may have many errors. Secondly, the boosting technique makes the second model which helps to clear the errors of the first model. Likewise, the third model is added and it goes on until the prediction is perfect without any errors for all dataset or maximum number of models or trees are added.

Now let us consider an example for understanding the boosting concept clearly. Let us take an example of checking if a message is spam or not spam. For checking an message is spam or not we make some rules for sort a message for example

  1.  Check the message has only a picture, then it is likely to be spam
  2.  Check the message has only a link, then it is likely to be a Spam
  3. Check for some words present like prize winner, then it must be a Spam
  4. Check if the message is from a correct number or domain then it is not a spam
  5. Check the message is from a genuine number then it is not a spam

Now we have five conditions for checking if a message is SPAM or not. Did you believe these conditions individually only enough to check for a message is spam or not. The answer is a NO, so we can call these conditions WEAK LEARNERS. 
So boosting is coming to the picture in these scenarios where it combines the weak learners to make it perfect for giving the right output. Boosting uses these individual conditions for checking spam combines to form a strong condition for message spam check. It uses the methods like

  1. Using average
  2. Prediction has higher vote.

Steps to implement the boosting:

  1. A weak learner makes a prediction and assigns equal valued weights to each data point.
  2. The model is evaluated. If there are errors in the model, the weights are altered to pay more attention to the data points with prediction errors.
  3. This learning process repeats until the model reaches a maximum accuracy level (or we run out of iterations).

Boosting algorithms 

There are different variants of the boosting algorithms. Before going to check the boosting algorithms we have to keep in mind that boosting is not a specific algorithm, it acts as a generic method that helps to improve a specific model. In boosting, we have to prefer which model we are going to use like regression or decision trees which boosting needs to improve. 
We will go over two traditional methods for classification and regression and a modern boosting approach optimized for performance.

AdaBoost or Adaptive Boosting

Boosting

Adaptive boosting or AdaBoost was the original boosted classifier that selects features that improve the model’s predictive power. AdaBoost works by constructing decision stumps - individual decision trees with a single split, as shown above with each of the individual trees.  

As you can see, the individual stumps are not great classifiers. However, incorrectly classified data points are weighted in the AdaBoost scheme. At the end of constructing several of these trees, the weights are added from these weak learners. The final model built from all of these weak learners can find complex decision boundaries that can accurately classify data points.

From the picture above check the tree 1 where there are equally assigned weights for each data point like plus and minus and we can also see a decision stump represented by a straight line that classifies the data points. 

From the picture we can understand that the classification is not proper as not all the plus are inside the decision stump. Now we add higher weights to the plus and make another decision stump.

Note check the picture 2 where we add more weights for the plus and draw another decision stump but it is also not proper prediction as some minus are also inside the stump it may cause classification errors. So we add some weights to these minus and draw another decision stump.

Likewise after picture 3, we will get the perfect classification shown in tree 4 by combining the previous three individual trees. Adaboost works on the principle of combining the weak individual learners to make a strong learner.

Gradient Boosting

Boosting

Gradient boosting is another boosting variant. However, the main difference between AdaBoost and Gradient Boosting is that AdaBoost identifies model errors by adding a penalty value to the incorrectly classified data points. However, gradient boosting uses gradient descent to correct the model weights.

Another minor point about the difference between AdaBoost and gradient boosting is the loss function. AdaBoost minimizes the exponential loss function, while gradient boosting can use any differentiable loss function. The exponential loss function is prone to outliers; thus gradient boosting is an ideal boosting algorithm for noisy data.

XGBoost

Boosting

While gradient boosting is a powerful method, its original implementation was not optimized for real-world applications. Thus, eXtreme Gradient Boosting (XGBoost) was created to make gradient boosting accessible towards a wide variety of problems.

XGBoost is simply gradient boosting with several improvements that allow it to run faster and more accurately. 

Several of these improvements are listed below, including:

  1.  Parallelized tree building with gradient boosting
  2.  Tree pruning
  3.  Efficient use of hardware
  4.  Regularization to avoid overfitting
  5.  Handling missing data
  6.  Built-In Cross-Validation

XGBoost is a robust out-of-the-box algorithm to get started with many complex data science problems. 

Advantages of boosting

  1.   Easy implementation: It is very easy to implement as it does not need a specific preprocessing of data. Also it has many methods to handle    any of the data which is missing.
  2.  Very low bias: Boosting is done by combining many trees in sequential order which needs a lot of observations and as a result it will reduce the bias.
  3.  High efficiency in Computation: It takes only the needed features, which increase the power to provide proper efficiency that Increases the computational efficiency of the model.

Disadvantages of boosting

  1. Overfitting: There will be chances for overfitting the model.
  2. Intensive computation: Boosting is done by combining many trees to make a strong algorithm that increase the computation.

Applications of Boosting

  1. Healthcare: Boosting helps in the medical field by reducing the chances for an error in medical predictions by combining many weak observations like cardiovascular or cancer survival rate etc.
  2.  IT: It helps in many ways in a wide range in the IT industry like the search engines to properly evaluate a low rank page.
  3.  Finance:  Boosting method helps in the financial field to reduce the errors in critical tasks like pricing, fraud detection etc.