Hello,This article is mainly to understand idea behind Descriptive Statistics. Here i am explaining theory and code for the same. I hope readers will enjoy.

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset’s properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

## The following code loads in the packages we will need and also the `iris` dataset.

import pandas as pd

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris_obj = load_iris()

# Dataset preview

iris_obj.data # Names of the columns

iris_obj.feature_names # Target variable

iris_obj.target # Target names

iris_obj.target_names # name of target variable

*`load_iris()`* loads in an object containing the iris dataset, which I stored in `*iris_obj*`. I now turn this into a *`DataFrame`*.

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))iris # prints iris datairis.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)iris # prints labeled data

For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

iris_grps = iris.groupby("species")for name, data in iris_grps:

print(name)

print(\"---------------------\\n\\n\")

print(data.iloc[:, 0:4])

print(\"\\n\\n\\n\")

A lot of the methods for getting summary statistics for a *`DataFrame`* also work for group objects.

# Getting the Basics

Let’s compute some basic statistics.

This number is the number of rows in the dataset, and can be obtained via *`count()`*.

iris.count()output:

sepal length (cm) 150

sepal width (cm) 150

petal length (cm) 150

petal width (cm) 150

species 150

dtype: int64

The **sample mean** is the arithmetic mean of the dataset.

iris.mean() # Sample mean for every numeric columnoutput:

sepal length (cm) 5.843333

sepal width (cm) 3.054000

petal length (cm) 3.758667

petal width (cm) 1.198667

dtype: float64

The **sample median** is the “middle” data point, after ordering the dataset. Let X(i) represent ordered data X(1) is smallest, x(n) largest).

iris.median() # Sample median for every numeric columnoutput:

sepal length (cm) 5.80

sepal width (cm) 3.0

petal length (cm) 4.35

petal width (cm) 1.30

dtype: float64

The **sample variance** is a measure of dispersion, roughly the “average” squared distance of a data point from the mean. The **standard deviation** is the square root of the variance and interpreted as the “average” distance a data point is from the mean.

iris.var() # Sample variance for every numeric columnoutput:

sepal length (cm) 0.685694

sepal width (cm) 0.188004

petal length (cm) 3.113179

petal width (cm) 0.582414

dtype: float64iris.std() # Sample standard deviance for every numeric columnoutput:

sepal length (cm) 0.828066

sepal width (cm) 0.433594

petal length (cm) 1.764420

petal width (cm) 0.763161

dtype: float64

The 𝑝th percentile is the number in the dataset such that roughly *𝑝*% of the data is less than this number. This number is also referred to as a quantile.

iris.quantile(.1) # The 10th percentileoutput:

sepal length (cm) 4.8

sepal width (cm) 2.5

petal length (cm) 1.4

petal width (cm) 0.2

Name: 0.1, dtype: float64iris.quantile(.95) # The 95th percentileoutput:

sepal length (cm) 7.255

sepal width (cm) 3.800

petal length (cm) 6.100

petal width (cm) 2.300

Name: 0.95, dtype: float64iris.quantile(.75) # Commonly known as the third quartileoutput:

sepal length (cm) 6.4

sepal width (cm) 3.3

petal length (cm) 5.1

petal width (cm) 1.8

Name: 0.75, dtype: float64iris.quantile(.25) # Commonly known as the first quartileoutput:

sepal length (cm) 5.1

sepal width (cm) 2.8

petal length (cm) 1.6

petal width (cm) 0.3

Name: 0.25, dtype: float64

If *𝑄𝑖 *denotes the *𝑖*th quartile, the **inner-quartile range** (**IQR**) is the difference between the third quartile and the first quartile.

# There is no function for computing the IQR but it is nevertheless easy to obtainiris.quantile(.75) - iris.quantile(.25)output:

sepal length (cm) 1.3

sepal width (cm) 0.5

petal length (cm) 3.5

petal width (cm) 1.5

dtype: float64

Other interesting quantities include the maximum and minimum values.

iris.max()output:

sepal length (cm) 7.9

sepal width (cm) 4.4

petal length (cm) 6.9

petal width (cm) 2.5

species virginica

dtype: objectiris.min()output:

sepal length (cm) 4.3

sepal width (cm) 2

petal length (cm) 1

petal width (cm) 0.1

species setosa

dtype: object

Many of these summaries work for grouped data as well.

iris_grps.mean()output:

species| sepal length | sepal width | petal length | petal width

setosa 5.006 3.418 1.464 0.244 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026

iris_grps.quantile(.75)iris_grps.quantile(.75) - iris_grps.quantile(.25)

# Other Useful Methods

The method `describe()`

gets a number of useful summaries for a dataset.

`iris.describe()`

`# This also works well for grouped data.`

iris_grps.describe()

If we want custom numerical summaries, we can write functions to compute them for Pandas `Series`

then apply them to the columns of a `DataFrame`

.

I demonstrate by writing a function that computes the **range**, which is the difference between the maximum and minimum of a dataset.

# Compute the range of a dataset

def range_stat(s):

return s.max() - s.min()iris.iloc[:, 0:4].apply(range_stat)output:

sepal length (cm) 3.6

sepal width (cm) 2.4

petal length (cm) 5.9

petal width (cm) 2.4

dtype: float64# Use aggregate() for groups

iris_grps.aggregate(range_stat)output:

Thanks for reading. If this article is helpful please do subscribe this medium blog.

References:

- Geeksforgeeks.