Hello friends today I am going to explain use of cross-validation using python a simple example.please go through the cross validation theory.

**Regression** refers to the prediction of a continuous variable (income, age, height, etc.) using a dataset’s features. A **linear model** is a model of the form:

Here *𝜖*ϵ is an **error term**; the predicted value for *𝑦*y is given by

so,

*𝜖* is almost never zero, so for regression we must measure “accuracy” differently. The **sum of squared errors (SSE)** is the sum

(letting *𝑦𝑖*=*𝛽*0+*𝛽*1*𝑥*1,*𝑖*+*𝛽*2*𝑥*2,*𝑖*+…+*𝛽𝐾𝑥𝐾*,*𝑖*+*𝜖𝑖* and *𝑦*̂*𝑖* defined analogously).

We might define the “most accurate” regression model as the model that minimizes the SSE. However, when measuring performance, the **mean squared error (MSE)** is often used. The MSE is given by

**Ordinary least squares (OLS)** is a procedure for finding a linear model that minimizes the SSE on a dataset. This is the simplest procedure for fitting a linear model on a dataset. To evaluate the model’s performance we may split a dataset into training and test set, and evaluate the trained model’s performance by computing the MSE of the model’s predictions on the test set. If the model has a high MSE on both the training and test set, it’s under-fitting. If it has a small MSE on the training set and a high MSE on the test set, it is over-fitting.

With OLS the most important decision is which features to use in prediction and how to use them. “Linear” means linear in coefficients only; these models can handle many kinds of functions.

Many approaches exist for deciding which features to include. For now we will only use cross-validation.

**Fitting a Linear Model with OLS**

OLS is supported by the `LinearRegression`

object in **scikit-learn**, while the function `mean_squared_error()`

computes the MSE.

I will be using OLS to find a linear model for predicting home prices in the Boston house price dataset, created below.

from sklearn.datasets import load_boston

from sklearn.cross_validation import train_test_splitboston_obj = load_boston()

data, price = boston_obj.data, boston_obj.target

data[:5, :]price[:5]data_train, data_test, price_train, price_test = train_test_split(data, price)

data_train[:5, :]price_train[:5]

We will go ahead and use all features for prediction in our first linear model. (In general this does *not* necessarily produce better models; some features may introduce only noise that makes prediction *more* difficult, not less.)

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

import numpy as npols1 = LinearRegression()

ols1.fit(data_train, price_train) # Fitting a linear model

ols1.predict([[ # An example prediction

1, # Per capita crime rate

25, # Proportion of land zoned for large homes

5, # Proportion of land zoned for non-retail business

1, # Tract bounds the Charles River

0.3, # NOX concentration

10, # Average number of rooms per dwelling

2, # Proportion of owner-occupied units built prior to 1940

10, # Weighted distance to employment centers

3, # Index for highway accessibility

400, # Tax rate

15, # Pupil/teacher ratio

200, # Index for number of blacks

5 # % lower status of population

]])OUTPUT:

array([ 38.4845554])

predicting

predprice = ols1.predict(data_train)

predprice[:5]OUTPUT:

array([ 22.93387135, 23.01140472, 29.89639538, 18.66387952, 27.20777345])mean_squared_error(price_train, predprice)OUTPUT:

22.682766677837442np.sqrt(mean_squared_error(price_train, predprice))

OUTPUT:4.7626428249279247

The square root of the mean squared error can be interpreted as the average amount of error; in this case, the average difference between homes’ actual and predicted prices. (This is almost the standard deviation of the error.)

For cross-validation, I will use `cross_val_score()`

, which performs the entire cross-validation process.

from sklearn.model_selection import cross_val_scoreols2 = LinearRegression()

ols_cv_mse = cross_val_score(ols2, data_train, price_train, scoring='neg_mean_squared_error', cv=10)

ols_cv_mse.mean()

OUTPUT:-25.52170955017451

The above number is the negative average MSE for cross-validation (minimizing MSE is equivalent to maximizing the negative MSE). This is close to our in-sample MSE.

Let’s now see the MSE for the fitted model on the test set.

testpredprice = ols1.predict(data_test)

mean_squared_error(price_test, testpredprice)OUTPUT:

20.972466191523765np.sqrt(mean_squared_error(price_test, testpredprice))

OUTPUT:

4.5795705247898262

Over-fitting is minimal, it seems. from this basic idea we can implement cross-validation on the other data sets if it is over fitting.

Thanks for reading.