Linear Models and OLS use of cross-validation in python

A AKSHAY
4 min readFeb 10, 2020

--

Hello friends today I am going to explain use of cross-validation using python a simple example.please go through the cross validation theory.

Regression refers to the prediction of a continuous variable (income, age, height, etc.) using a dataset’s features. A linear model is a model of the form:

linear regression equation

Here 𝜖ϵ is an error term; the predicted value for 𝑦y is given by

so,

𝜖 is almost never zero, so for regression we must measure “accuracy” differently. The sum of squared errors (SSE) is the sum

sum of squared errors

(letting 𝑦𝑖=𝛽0+𝛽1𝑥1,𝑖+𝛽2𝑥2,𝑖+…+𝛽𝐾𝑥𝐾,𝑖+𝜖𝑖 and 𝑦̂𝑖 defined analogously).

We might define the “most accurate” regression model as the model that minimizes the SSE. However, when measuring performance, the mean squared error (MSE) is often used. The MSE is given by

mean squared error

Ordinary least squares (OLS) is a procedure for finding a linear model that minimizes the SSE on a dataset. This is the simplest procedure for fitting a linear model on a dataset. To evaluate the model’s performance we may split a dataset into training and test set, and evaluate the trained model’s performance by computing the MSE of the model’s predictions on the test set. If the model has a high MSE on both the training and test set, it’s under-fitting. If it has a small MSE on the training set and a high MSE on the test set, it is over-fitting.

With OLS the most important decision is which features to use in prediction and how to use them. “Linear” means linear in coefficients only; these models can handle many kinds of functions.

these are linear models
this is non-linear model

Many approaches exist for deciding which features to include. For now we will only use cross-validation.

Fitting a Linear Model with OLS

OLS is supported by the LinearRegression object in scikit-learn, while the function mean_squared_error() computes the MSE.

I will be using OLS to find a linear model for predicting home prices in the Boston house price dataset, created below.

from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
boston_obj = load_boston()
data, price = boston_obj.data, boston_obj.target
data[:5, :]
price[:5]data_train, data_test, price_train, price_test = train_test_split(data, price)
data_train[:5, :]
price_train[:5]

We will go ahead and use all features for prediction in our first linear model. (In general this does not necessarily produce better models; some features may introduce only noise that makes prediction more difficult, not less.)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
ols1 = LinearRegression()
ols1.fit(data_train, price_train) # Fitting a linear model
ols1.predict([[ # An example prediction
1, # Per capita crime rate
25, # Proportion of land zoned for large homes
5, # Proportion of land zoned for non-retail business
1, # Tract bounds the Charles River
0.3, # NOX concentration
10, # Average number of rooms per dwelling
2, # Proportion of owner-occupied units built prior to 1940
10, # Weighted distance to employment centers
3, # Index for highway accessibility
400, # Tax rate
15, # Pupil/teacher ratio
200, # Index for number of blacks
5 # % lower status of population
]])
OUTPUT:
array([ 38.4845554])

predicting

predprice = ols1.predict(data_train)
predprice[:5]
OUTPUT:
array([ 22.93387135, 23.01140472, 29.89639538, 18.66387952, 27.20777345])
mean_squared_error(price_train, predprice)OUTPUT:
22.682766677837442
np.sqrt(mean_squared_error(price_train, predprice))
OUTPUT:
4.7626428249279247

The square root of the mean squared error can be interpreted as the average amount of error; in this case, the average difference between homes’ actual and predicted prices. (This is almost the standard deviation of the error.)

For cross-validation, I will use cross_val_score(), which performs the entire cross-validation process.

from sklearn.model_selection import cross_val_scoreols2 = LinearRegression()
ols_cv_mse = cross_val_score(ols2, data_train, price_train, scoring='neg_mean_squared_error', cv=10)
ols_cv_mse.mean()
OUTPUT:
-25.52170955017451

The above number is the negative average MSE for cross-validation (minimizing MSE is equivalent to maximizing the negative MSE). This is close to our in-sample MSE.

Let’s now see the MSE for the fitted model on the test set.

testpredprice = ols1.predict(data_test)
mean_squared_error(price_test, testpredprice)
OUTPUT:
20.972466191523765
np.sqrt(mean_squared_error(price_test, testpredprice))
OUTPUT:
4.5795705247898262

Over-fitting is minimal, it seems. from this basic idea we can implement cross-validation on the other data sets if it is over fitting.

Thanks for reading.

--

--