Bagging & Random Forests

The Issue with Classification and Regression Trees (CART)

While CART is an intuitive and powerful algorithm, several weaknesses in the CART method are:

  1. Overfitting: Decision trees are prone to overfitting, based on the maximum depth selected, and whether or not the data is skewed for highly variable features.
  2. Model instability: Even a small perturbation to the training dataset (like adding a single data point or having a slightly different training set) can have a dramatic impact on the structure of a tree.
  3. Sensitivity to imbalanced data: the way decision trees identify branch splits is dependent on the number of labels for each class. 

To address these issues, ensemble methods have been proposed which improve the accuracy of decision trees. We’ll discuss two related modifications: 
1.    Bagging 
2.    Random forests.  

Bootstrapping

To understand Bagging, we need to understand a statistical method called the bootstrap sampling approach. The goal for a bootstrap approach is to estimate an unknown data distribution, given a limited dataset size.

Bootstrapping is to calculate a population parameter; how bootstrapping works is like randomly forming some samples of data from the dataset with replacement.

Bootstrapping

The basic idea is that during random data sampling, or taking a smaller subset of the dataset, we can also re-sample the data: we draw repeated samples from the original data set. This process is repeated a large number of times (1,000-10,000 times), and by taking the mean or any other statistic during each iteration, we can gain a sense of the data distribution.

Bagging

Bagging can be also called bootstrap aggregation, which is used in ensemble learning methods to increase the accuracy and the performance of ensemble learning methods. How bagging is increasing the performance is by reducing the variance in a dataset. 
Bagging is commonly used in decision tree algorithms. It will be used in both classification and regression methods as it helps to reduce the overfitting problem.

 

Bagging

Bagging aims to address the model instability problem for CART using the bootstrapping method. The bagging algorithm is as follows:

  1. Create a large number of random training set subsamples with replacement
  2. For each set, training a CART model
  3. Given the test set, calculate an average prediction from each model
  4. For a classification task, the majority vote across all models for a given class is the final predicted class.
  5. For a regression task, the mode across all models is the final predicted value.

This procedure stabilizes the final model answer because it chooses paths that are shown in a majority of trees. However, bagging does not solve another key issue with tree algorithms: high variance and highly correlated features. 

The issue is simple: when constructing each tree, we always use all of the features. Some features naturally have higher variance or are highly correlated with each other. These features will always show up as an important feature in an ensemble of CARTs but may mask the true relationships within the dataset. 

How Bagging Works

  1. Suppose we have a dataset that has many features and observations. First, we have to select a random sample from this dataset with no replacement. [We have to create multiple sample subsets].
  2. Create a base model for each subset we have selected
  3. Cream a model by learning from each dataset parallel and independent of each other.
  4. Combine all the predictions of sub-models to get the final result
Ensemble Classifier

Advantages of Bagging

  • Improves the model by reducing the overfitting problem
  • Increase the performance and accuracy of the model
  • It is able to work in high dimension data.

Disadvantages of bagging

  • It can result in some loss of interpretability of the model
  • If not properly done, bagging can cause high bias.
  • Bagging is relatively little expensive even if it makes it more accurate.

Random forests

Random-Forests


To address the issue with high variance features or collinearity, a better method would be to omit highly variable features or to remove some correlated features periodically during the bagging process. This would reveal important tree structures that are typically hidden by these features. 

Random forests address this issue by making one minor tweak to the bagging process: instead of using all features while training each tree, a random and smaller subset of features is used. This eventually removes problematic features from the data, and hidden tree structures can begin to emerge after repeating this process many times.

The number of features to be used for branch splitting is a hyperparameter that is introduced in a random forest, and needs to be specified. 

Random forest is a bootstrap algorithm with a CART model. Consider we have 1000 observations and have 10 variables. Random forest will make different CART using these samples and initial variables. Here it will take some random samples and some initial variables and make a CART model. Now it will repeat the process for some time and predict the final result, which will be the mean of every single prediction.

In simple words, A random forest is a collection of random decision trees. But works on 2 concepts to find the final prediction from the multiple trees. 

  1. A random sampling of the dataset while building trees
  2. A random subset of features while splitting the nodes

How Random Forest algorithm works

  1.  Consider an M number of random data from a dataset
  2.  Build a decision tree based on these records
  3. Select the number of trees you need and repeat steps 1 and 2

Out-of-bag estimated performance

During each bootstrap, there are samples that are not included in the model training. These are called out-of-bag samples, and they are used as a validation set to evaluate model performance for individual trees. 

By taking the average of the out-of-bag performance, we get an estimated accuracy of the bagged models that is similar to a cross-validation procedure. This is known as out-of-bag accuracy. 

This is another reason why bootstrap is a powerful statistical method for CART and machine learning models: it introduces a robust statistical method to assess model accuracy. 

Advantages

  • Random forest algorithm is unbiased
  • It is very stable as if we add a new dataset will not create many problem
  • Random forest will work smoothly in both categorical problems and numerical problems
  • It works effectively even if it is missing in the dataset.

Disadvantages

  • High complexity in the structure and process, which include multiple decision trees
  • Need a large time for training and all due to the complexity