Regression analysis in Machine learning


July 11, 2022, Learn eTutorial
1232

Regression analysis is a simple method to show some relation between the single predicted value that we can call a target and the multiple predictor values. 
Here the target variable is dependent on the input predictor values, but the input values are independent. We can use regression analysis if the value of the target is a continuous value like salary, age, or weight. In addition, we use it in

  • Stock predictions
  • Prediction of weather
  • House price predictions

For better understanding let us simplify the regression analysis as it will show us how the value of a target is changing that is dependent on the input values with changes in any one input value where the other input values are the same.
Let us consider an example of an IT company data of continuous five years about the number of employees and the total turnover of the company. 

Number of employees Total turnover
100 $20000
200 $60000
300 $100000
400 $140000
500 $190000
700 ?

The above table contains the data of the number of employees and the company turnover. As it is an IT company the employees are the resources and an increase in the number of resources will increase company turnover.

Now the company is planning to increase the employees and they want to know the total turnover of that year. In such cases, we can depend on the regression analysis for the best prediction. 

Here regression will analyze the input variables (in our case it is the number of employees) and make a relation in between the variables and it predicts the output which will be continuous and real, based on these input variables.

Regression

We use the graph to show the regression using the input and output variables as data points. Regression will be a line or a simple curve that is passing through the data points in this input-output variable graph with a minimum distance with the data points. This distance between the regression line and data points can tell us the model got the relation between the input and output variables.

What is the significance of Regression Analysis?

As we understand the regression analysis we all have a question in our mind why we need regression analysis? In this modern world, we can see many real applications that need an accurate continuous prediction depend on the input variables and can able to predict how the change in the input variable affects the output prediction.
Let us take the stock market we can easily identify the stock trend, which helps to invest. Consider a company, which can use regression analysis for predicting future market sales with the previous data and the changes. Like that weather forecast, economy forecast, and many applications are there that use the regression analysis.

Benefits

  1. It can predict much more accurate results with independent input variables.
  2. We can able to understand the input factor that is strong to make a change in prediction, like strong and weak factors.
  3. Find the trends in the market.

Common Terms used in Regression Analysis

  1. Target or Output: The target variable is the output or the predicted variable of regression analysis. It is dependent on the input variables.
  2. Input or predictor: As the name suggests, it is the input, we have to provide to the regression analysis for prediction. Predictor variables may be one or many and they are independent variables.
  3. Outliners: Outliners are the input data that is not accurate. Outliners as the name suggest it may be very error data that may produce error predictions. We must avoid outliers in regression analysis.
  4. Underfitting: In supervised learning, we use training data to train the algorithm. If the algorithm does not work properly even with the train data(poor results for training data and test data) we call it underfitting.
  5. Overfitting: It is a condition where our algorithm is working properly with the train data but produces an error output with the test data. We call such a situation overfitting.
  6. Co linearity: We know our input or the predictor variables are independent. When in some cases there will be some relation between the input variables, which we call, co-linearity.

Types of Regression

There are various types of regressions we are using in machine learning, which have different characteristics and importance. We have to select one depends upon the data and our needs, they are

Regression
  1. Linear Regression
  2. Logistic Regression
  3. Polynomial Regression
  4. Decision Tree Regression
  5. Random Forest Regression
  6. Ridge Regression
  7. Lasso Regression
  8. Support Vector Regression

Linear Regression

Linear regression is the basic type of regression in machine learning. It uses the statistical method for the prediction. Linear regression contains an input variable that we represent as the X-axis of the linear regression graph and the target variable that we represent as the Y-axis. Linear regression makes a line in the regression graph.

In linear regression, if there is more than one independent variable is present then we call it multiple linear regression.

Regression

Linear regression is expressed by the equation

Regression
  • Y is the target variable
  • b is the slope of the line
  • a is the intercept
  • e is the error

The applicability of linear regression is

  1. Home price prediction
  2. Salary forecasting
  3. Traffic prediction
  4. Sales prediction, etc.

Logistic Regression

Logistic regression is another type of regression analysis method that works on the concept of probability, which we used if we need to solve a classification problem. It means the output variable of logistic regression will be binary value ‘0’ or ‘1’.

Logistic regression works with the problems that need to be classified such as true or false, yes or no, spam or not spam like that. 

In logistic regression, we use a sigmoid curve to represent the relation between the input (independent) and output (target) variables. We represent the logistics regression as 
 

Regression

Where,

  • f(x) is the output variable
  • x is the input variable
  • e is the base

Finally, we have to provide the input variables which produce a shaped curved graph.

Regression

We can divide the logistic regression into 3 types that are,

  • Binary
  • Multiclass like fruits category
  • Ordinal like low medium-high
     

Polynomial Regression

Polynomial regression is very alike to multiple linear regression with some modification. In a polynomial graph, the relationship between the input and output variables will be denoted by an nth degree, which means polynomial regression is represented by a nonlinear curve between the values X and the Y-axis.

Consider a dataset with some data points plotted in a graph in a nonlinear fashion, in such a case, the linear regression method will not work properly. There we need a nonlinear curve to connect all the data points that is called the polynomial regression.

In polynomial regression, the real features are converted into polynomial features with some degree which we mentioned as nth degree and fitted with the polynomial line.
 

Regression

We represent the polynomial equation as

Regression
  • Y is the output we expect
  • The θ0, θ1, and all are the coefficients
  • x is the input variable
     

Decision Tree Regression

Decision tree regression is a tree structure, which can be, used for both classification and regression types. As we know decision tree structure has the internal nodes, branches, and the leaf which all are used for the problems both category and number data.

Decision tree structure

  1. A node represents the test for the attribute
  2. The branch represents the result after the test
  3. Leaf represents the output or predicted value.
  4. Root represents the parent dataset

As we all know it is like a tree structure that starts with the root dataset and split into the left and right child node that represents the subset of the parent dataset. Again it split into their children making them parents. A decision tree is drawn below from a clear understanding.
 

Regression

Random Forest Regression

It is a more complex regression method, which combines more than one decision tree regressions. Random forest regression is a very powerful algorithm that can be used for both classification and regression jobs.

Random forest regression predicts the output by combining decision trees and by the average of each tree result. Decision trees used in the random forest regression are termed, base models. A random decision tree can be represented by the formula 

Regression

that is g(x)= f0(x)+ f1(x)+ f2(x)+....

Random forest regression is helpful to prevent the problem called overfitting in the model.

Regression

Ridge Regression

Ridge regression is one of the flexible and powerful regression analysis which is used when there is a high correlation between the input variables. If the co-linearity is very high, we will add some bias into the ridge regression method. The amount we add in the bias is called a penalty in ridge regression. Ridge regression has less susceptible to the overfitting problem. 

We can represent Ridge regression using the formula

Regression

Ridge will help to solve problems with a large number of parameters and have a high correlation between them. Ridge is also used to reduce the complexity of a model that we call L2 regularization.

Lasso Regression

Like the Ridge regression, Lasso regression also used to reduce the complexity of the model by adding some penalty. The only difference is we add the actual amount to a penalty in Lasso, as in ridge we use a square of the amount.

Lasso regression can shrink to an absolute zero value. Lasso regression is also called L1 regularization, which is represented by

Regression

Support Vector Regression

Support vector regression can be used for both regression and classification algorithms. If we use it for regression, we call it as support vector regression. Support vectors use the continuous input variables.

In support vector regression, we are trying to find a line that will reach almost all the data points and predict continuous variables called hyperplane with maximum margin.

The aim of the support vector regression is to make a boundary line with the hyperplane that covers a maximum number of data points.

Regression

In this graph, the green line represents the hyperplane and the dotted lines represent the boundary lines with respect to the hyperplane. Red dots are data points.