Bagging & Random Forests

January 3, 2022, Learn eTutorial

1565

There are some issues with Classification and Regression trees(CART), for solving these issues, the bagging and random forest methods are introduced in ensemble learning which helps to increase the Decision tree accuracy.

The Issue with Classification and Regression Trees (CART)

While CART is an intuitive and powerful algorithm, several weaknesses in the CART method are:

Overfitting: Decision trees are prone to overfitting, based on the maximum depth selected, and whether or not the data is skewed for highly variable features.
Model instability: Even a small perturbation to the training dataset (like adding a single data point or having a slightly different training set) can have a dramatic impact on the structure of a tree.
Sensitivity to imbalanced data: the way decision trees identify branch splits is dependent on the number of labels for each class.

To address these issues, ensemble methods have been proposed that improve the accuracy of decision trees. We’ll discuss two related modifications:
1. Bagging
2. Random forests.

Bootstrapping

To understand Bagging, we need to understand a statistical method called the bootstrap sampling approach. The goal of a bootstrap approach is to estimate an unknown data distribution, given a limited dataset size.

Bootstrapping is to calculate a population parameter; how bootstrapping works is like randomly forming some samples of data from the dataset with replacement.

The basic idea is that during random data sampling, or taking a smaller subset of the dataset, we can also re-sample the data: we draw repeated samples from the original data set. This process is repeated a large number of times (1,000-10,000 times), and by taking the mean or any other statistic during each iteration, we can gain a sense of the data distribution.

Bagging

Bagging can be also called bootstrap aggregation, which is used in ensemble learning methods to increase accuracy and performance of ensemble learning methods. How bagging is increasing the performance is by reducing the variance in a dataset.
Bagging is commonly used in decision tree algorithms. It will be used in both classification and regression methods as it helps to reduce the overfitting problem.

Bagging aims to address the model instability problem for CART using the bootstrapping method. The bagging algorithm is as follows:

Create a large number of random training set subsamples with replacement
For each set, training a CART model
Given the test set, calculate an average prediction from each model
For a classification task, the majority vote across all models for a given class is the final predicted class.
For a regression task, the mode across all models is the final predicted value.

This procedure stabilizes the final model answer because it chooses paths that are shown in a majority of trees. However, bagging does not solve another key issue with tree algorithms: high variance and highly correlated features.

The issue is simple: when constructing each tree, we always use all of the features. Some features naturally have higher variance or are highly correlated with each other. These features will always show up as an important feature in an ensemble of CARTs but may mask the true relationships within the dataset.

How Bagging Works

Suppose we have a dataset that has many features and observations. First, we have to select a random sample from this dataset with no replacement. [We have to create multiple sample subsets].
Create a base model for each subset we have selected
Cream a model by learning from each dataset parallel and independent of each other.
Combine all the predictions of sub-models to get the final result

Advantages of Bagging

Improves the model by reducing the overfitting problem
Increase the performance and accuracy of the model
It is able to work in high-dimension data.

Disadvantages of bagging

It can result in some loss of interpretability of the model
If not properly done, bagging can cause high bias.
Bagging is relatively little expensive even if it makes it more accurate.

Random forests

To address the issue with high variance features or collinearity, a better method would be to omit highly variable features or to remove some correlated features periodically during the bagging process. This would reveal important tree structures that are typically hidden by these features.

Random forests address this issue by making one minor tweak to the bagging process: instead of using all features while training each tree, a random and smaller subset of features is used. This eventually removes problematic features from the data, and hidden tree structures can begin to emerge after repeating this process many times.

The number of features to be used for branch splitting is a hyperparameter that is introduced in a random forest, and needs to be specified.

Random forest is a bootstrap algorithm with a CART model. Consider we have 1000 observations and 10 variables. Random forest will make different CART using these samples and initial variables. Here it will take some random samples and some initial variables and make a CART model. Now it will repeat the process for some time and predict the final result, which will be the mean of every single prediction.

In simple words, A random forest is a collection of random decision trees. But works on 2 concepts to find the final prediction from the multiple trees.

A random sampling of the dataset while building trees
A random subset of features while splitting the nodes

How the Random Forest algorithm works

Consider an M number of random data from a dataset
Build a decision tree based on these records
Select the number of trees you need and repeat steps 1 and 2

Out-of-bag estimated performance

During each bootstrap, there are samples that are not included in the model training. These are called out-of-bag samples, and they are used as a validation set to evaluate model performance for individual trees.

By taking the average of the out-of-bag performance, we get an estimated accuracy of the bagged models that is similar to a cross-validation procedure. This is known as out-of-bag accuracy.

This is another reason why bootstrap is a powerful statistical method for CART and machine learning models: it introduces a robust statistical method to assess model accuracy.

Advantages

The random forest algorithm is unbiased
It is very stable, if we add a new dataset will not create any problem
Random forest will work smoothly in both categorical problems and numerical problems
It works effectively even if it is missing in the dataset.

Disadvantages

High complexity in the structure and process, which include multiple decision trees
Need a large time for training and all due to the complexity

OtherTutorials

VIEW ALL

Learn Machine Learning

Bagging & Random Forests

The Issue with Classification and Regression Trees (CART)

Bootstrapping

Bagging

How Bagging Works

Advantages of Bagging

Disadvantages of bagging

Random forests

How the Random Forest algorithm works

Out-of-bag estimated performance

Advantages

Disadvantages

Related Tutorials

OtherTutorials

Python

Python

C

C

Java

Java

Machine Learning

Machine Learning

R

R

PHP

PHP

Golang

Golang

Artificial Intelligence

Artificial Intelligence

HTML

HTML

Cyber Security

Cyber Security

C++

C++

Data Science

Data Science

Join Us

Learn Machine Learning

Introduction to Machine Learning

Applications of Machine learning

Why Machine Learning?

The Machine Learning Workflow

Data Visualization and Exploratory Data Analysis

Data processing

Trends in Machine Learning 2021

Machine learning models

Introduction to Supervised Learning

Regression Analysis

Introduction to cost functions

Linear Regression

Feature Selection

Outliers

Regularization

Polynomial Regression

Splines

Classification

Logistic Regression

Discriminant Analysis

Support Vector Machines

Naive Bayes

Clustering

K means Clustering

Ensemble Learning

Classification and Regression Trees