In the previous section, we reviewed several ways to extend the use of linear models to be less prone to overfitting. However, if the relationship between the data and the output is truly nonlinear, we need to adopt more complex models.

In polynomial regression, we can make a relation between the independent variable and the predicted output with the help of an n^{th} degree variable which helps to show more complex relations than linear regression.

The equation for the polynomial regression is stated below

We can say the polynomial regression is a special case of linear regression because we are adding an n^{th} degree polynomial to multiple linear regression to make polynomial regression. In simple words, we can say the polynomial regression is a linear regression with some modification for accuracy increasing.

Polynomial regression uses a linear regression graph with some modification in include the complicated nonlinear functions. In this case, we are using a dataset that is not linear.

Importance of polynomial regression

As we know the first point is we can't do a linear regression method for a nonlinear dataset. If we do so it will make a huge error and very low accuracy. For such non-linear datasets, we use the polynomial regression which can make the graph through maximum data points.

For a nonlinear dataset, the data points will be in a nonlinear fashion, so we cannot connect the data points with a linear line. We can clearly understand that using the graph.

So from the comparison graph, we can understand that if the dataset is nonlinear we must need a nonlinear graph or polynomial regression for good accuracy and better result.

We can use the polynomial regression in the areas where the input dataset is not linear which means in some complex outcomes, for example

- Progress of a pandemic disease
- Tissue growth rate
- Carbon isotopes distribution.

This tutorial will explore how we can capture more nonlinear trends with polynomial regression.

Recall from the linear regression tutorial that a model is considered to be linear if the regression coefficient β and the predictor x are proportional to one another, despite having nonlinear transformations on the dataset itself.

The following equation is a linear regression problem, despite being more complex:

The equation below is nonlinear, as it is not proportional to the predictor.

While we can make our models more complex like the equation above, we’ll stick to cases where we will make the features more complex with higher polynomial terms. Thus, the model is still linear in that the relationships can be scaled by β, and the contributions of each feature are additive.

- The model becomes more accurate when considering a nonlinear function that can best capture the relationship between inputs and outputs.
- Knowing the form of the underlying function can describe a natural mechanistic process.
- Can be applied to a large range of functions.

- The model can have many parameters, depending on how complex you want the model to be.
- The more complex the model, the more prone the model is to overfitting.
- Sensitive to the outliers
- Outliers presence will make the output result inaccurate and error full.

Let’s consider the Boston dataset, which contains data associated with Boston housing prices. There is a variable called LSTAT, which is the percentage of individuals belonging to a lower status of the population.

Let’s visualize the relationship between LSTAT and the target variable, MDEV or the median housing price in Boston.

```
import matplotlib.pyplot as plt
plt.figure(dpi=200)
plt.scatter(df["LSTAT"], target)
plt.xlabel("LSTAT")
plt.ylabel("MDEV")
plt.title("Relationship between LSTAT and MDEV")
```

The relationship between MDEV and LSTAT is not linear. If we try to fit a line through this dataset, we won’t get great results, as the nonlinear data will skew the line.

Let’s try to create a simple polynomial model through these two variables. Specifically, we can use the following relationship to construct the model:

```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as lm
# Create the polynomial dataset with the squared term.
pol_model = PolynomialFeatures(degree=2)
lstat_pol = pol_model.fit_transform(pd.DataFrame(Xtrain["LSTAT"]))
# Fit a linear model
pol_model = lm().fit(lstat_pol, Ytrain)
# Plot it.
plt.figure(dpi=200)
plt.scatter(Xtrain["LSTAT"], Ytrain, color='red')
plt.scatter(pd.DataFrame(Xtrain["LSTAT"]),
pol_model.predict(lstat_pol),
color='blue')
plt.title("Polynomial regression prediction")
plt.xlabel('LSTAT')
plt.ylabel('MDEV')
```

Even though we chose a relatively simple polynomial, just by eyeballing the relationship between the training dataset (red) and the prediction (blue), we can already tell that the model fits the data much better than a straight line.

If we choose to select a more complex model, we can potentially get an even better fit for the data. However, as we continue to add more polynomial terms, we run the risk of overfitting. This is something we need to systematically evaluate, using a method like cross-validation or prior knowledge about the dataset distribution.

Unfortunately, polynomial regression can take a lot of time to train, depending on the computational complexity of the problem - in other words, the complexity of the equation we try to fit will impact the computation time. Additionally, as the model becomes more complex, it becomes more prone to overfitting. Thus, polynomial regression can be used to model simple non-linear relationships but may take some time to fine-tune and train in real-world situations.