# Linear Models and OLS use of cross-validation in python

Hello friends today I am going to explain use of cross-validation using python a simple example.please go through the cross validation theory.

Regression refers to the prediction of a continuous variable (income, age, height, etc.) using a dataset’s features. A linear model is a model of the form:

Here 𝜖ϵ is an error term; the predicted value for 𝑦y is given by

so,

𝜖 is almost never zero, so for regression we must measure “accuracy” differently. The sum of squared errors (SSE) is the sum

(letting 𝑦𝑖=𝛽0+𝛽1𝑥1,𝑖+𝛽2𝑥2,𝑖+…+𝛽𝐾𝑥𝐾,𝑖+𝜖𝑖 and 𝑦̂𝑖 defined analogously).

We might define the “most accurate” regression model as the model that minimizes the SSE. However, when measuring performance, the mean squared error (MSE) is often used. The MSE is given by

Ordinary least squares (OLS) is a procedure for finding a linear model that minimizes the SSE on a dataset. This is the simplest procedure for fitting a linear model on a dataset. To evaluate the model’s performance we may split a dataset into training and test set, and evaluate the trained model’s performance by computing the MSE of the model’s predictions on the test set. If the model has a high MSE on both the training and test set, it’s under-fitting. If it has a small MSE on the training set and a high MSE on the test set, it is over-fitting.

With OLS the most important decision is which features to use in prediction and how to use them. “Linear” means linear in coefficients only; these models can handle many kinds of functions.

Many approaches exist for deciding which features to include. For now we will only use cross-validation.

Fitting a Linear Model with OLS

OLS is supported by the `LinearRegression` object in scikit-learn, while the function `mean_squared_error()` computes the MSE.

I will be using OLS to find a linear model for predicting home prices in the Boston house price dataset, created below.

`from sklearn.datasets import load_bostonfrom sklearn.cross_validation import train_test_splitboston_obj = load_boston()data, price = boston_obj.data, boston_obj.targetdata[:5, :]price[:5]data_train, data_test, price_train, price_test = train_test_split(data, price)data_train[:5, :]price_train[:5]`

We will go ahead and use all features for prediction in our first linear model. (In general this does not necessarily produce better models; some features may introduce only noise that makes prediction more difficult, not less.)

`from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorimport numpy as npols1 = LinearRegression()ols1.fit(data_train, price_train)    # Fitting a linear modelols1.predict([[    # An example prediction    1,      # Per capita crime rate    25,     # Proportion of land zoned for large homes    5,      # Proportion of land zoned for non-retail business    1,      # Tract bounds the Charles River    0.3,    # NOX concentration    10,     # Average number of rooms per dwelling    2,      # Proportion of owner-occupied units built prior to 1940    10,     # Weighted distance to employment centers    3,      # Index for highway accessibility    400,    # Tax rate    15,     # Pupil/teacher ratio    200,    # Index for number of blacks    5       # % lower status of population]])OUTPUT:array([ 38.4845554])`

predicting

`predprice = ols1.predict(data_train)predprice[:5]OUTPUT:array([ 22.93387135,  23.01140472,  29.89639538,  18.66387952,  27.20777345])mean_squared_error(price_train, predprice)OUTPUT:22.682766677837442np.sqrt(mean_squared_error(price_train, predprice))OUTPUT:4.7626428249279247`

The square root of the mean squared error can be interpreted as the average amount of error; in this case, the average difference between homes’ actual and predicted prices. (This is almost the standard deviation of the error.)

For cross-validation, I will use `cross_val_score()`, which performs the entire cross-validation process.

`from sklearn.model_selection import cross_val_scoreols2 = LinearRegression()ols_cv_mse = cross_val_score(ols2, data_train, price_train, scoring='neg_mean_squared_error', cv=10)ols_cv_mse.mean()OUTPUT:-25.52170955017451`

The above number is the negative average MSE for cross-validation (minimizing MSE is equivalent to maximizing the negative MSE). This is close to our in-sample MSE.

Let’s now see the MSE for the fitted model on the test set.

`testpredprice = ols1.predict(data_test)mean_squared_error(price_test, testpredprice)OUTPUT:20.972466191523765np.sqrt(mean_squared_error(price_test, testpredprice))OUTPUT:4.5795705247898262`

Over-fitting is minimal, it seems. from this basic idea we can implement cross-validation on the other data sets if it is over fitting.