In the previous lecture, we have discussed about linear regression, which is a straight line to connect the dependent and non-dependent variables, but with that linear line, it is not always possible to make a linear line. Then comes the polynomial regression to model nonlinear functions. However, we discussed that the more polynomial terms we add, the more prone the model was to overfitting.
To fit complex shapes that describe real data, we need a way to design complex functions without overfitting. For that, we have to make a new method that is a nonlinear regression but with a combination of linear and nonlinear functions to fit the data points, which is termed regression splines.
To overcome the disadvantages of linear and polynomial regression we introduced the regression splines. As we know in linear regression the dataset is considered as one, but in splines regression, we have to split the dataset into many parts which we call bin. And the points in which we divide the data are called knots and we use different methods in different bins. These separate functions we use in the different bins are called piecewise step functions.
Splines are a way to fit a high-degree polynomial function by breaking it up into smaller piecewise polynomial functions. For each polynomial, we fit a separate model and connect them all together.
We already discussed that linear regression is a straight line hence we made polynomial regression but it can make the model overfitting issue. The need for a model that can be used with the good properties of both linear and polynomial regression made the spline regression. While this sounds complicated, by breaking up each section into smaller polynomials, we decrease the risk of overfitting.
Because a spline breaks up a polynomial into smaller pieces, we need to determine where to break up the polynomial. The point where this division occurs is called a knot.
In the example above, each P_xrepresents a knot. The knots at the ends of the curves are known as boundary knots, while the knots within the curve are known as internal knots.
While we can visually inspect where to place these knots, we need to devise systematic methods to select knots.
Some strategies include:
The mathematics for splines can seem complicated without knowing some calculus and properties of piecewise functions. We’ll discuss the intuition beneath these algorithms.
If you’re interested in the specific mathematics underpinning splines, we can refer you to the Elements of Statistical Learning, 2nd Edition by Trevor Hastie, Rovert Tibshirani, and Jerome Friedman. This intermediate to advanced textbook is an essential read for aspiring data scientists.
Cubic splines require that we connect these different polynomial functions smoothly. This means that the first and second derivatives of these functions must be continuous. The plot below shows a cubic spline and how the first derivative is a continuous function.
Polynomial functions and other kinds of splines tend to have bad fits near the ends of the functions. This variability can have huge consequences, particularly in forecasting. Natural splines resolve this issue by forcing the function to be linear after the boundary knots.
Finally, we can consider the regularized version of a spline: the smoothing spline. The cost function is penalized if the variability of the coefficient is high. Below is a plot that shows a situation where smoothing splines are needed to get an adequate model fit.
To implement splines in Python, you can use the SciPy library. A useful example can be found here.
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
rng = np.random.default_rng()
x = np.linspace(-3, 3, 50)
y = np.exp(-x**2) + 0.1 * rng.standard_normal(50)
plt.plot(x, y, 'ro', ms=5)
spl = UnivariateSpline(x, y)
xs = np.linspace(-3, 3, 1000)
plt.plot(xs, spl(xs), 'g', lw=3)
spl.set_smoothing_factor(0.5)
plt.plot(xs, spl(xs), 'b', lw=3)
plt.show()
Output:
We have covered many non-linear regression models that are commonly used. In each case, we found the functions tended to be variants of linear models, but we stacked different layers of complexity. Generalized additive models (GAMs) can be considered to be a generalization of the methods covered so far.
For each regression method described so far, we added the contribution of each feature xi to predict some outcome yi.
In all of the cases described so far, we forced the relationship between βp and xp to be linear. The same went for polynomial regression - we just changed the superscript of xp to be whatever best mapped to y.
With GAMs, we assert that we can add whatever function we want to the model and predict y by adding these functions up.
Where the function f_p can be any linear/nonlinear function that links y to x_p. Because the function links x and y, it is known as a link function.
GAMs are incredibly powerful and are easy to interpret due to the addictive nature of the model and the flexibility built into the framework. Additionally, the method is regularized to avoid overfitting, adding to the appeal of GAMs for complex regression tasks and forecasting.
GAMs can be implemented using the statsmodels library.