Descriptive Statistics with Pandas on iris data(beginner)

5 min readFeb 4, 2020

Hello,This article is mainly to understand idea behind Descriptive Statistics. Here i am explaining theory and code for the same. I hope readers will enjoy.

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset’s properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The following code loads in the packages we will need and also the `iris` dataset.

import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris   
# sklearn.datasetsincludes common example datasets
# A function to load in the iris dataset
iris_obj = load_iris()   
# Dataset preview
iris_obj.data              # Names of the columns
iris_obj.feature_names     # Target variable
iris_obj.target            # Target names
iris_obj.target_names      # name of target variable

`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))iris # prints iris datairis.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)iris # prints labeled data

For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

iris_grps = iris.groupby("species")for name, data in iris_grps:
    print(name)
    print(\"---------------------\\n\\n\")
    print(data.iloc[:, 0:4])
    print(\"\\n\\n\\n\")

A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.

Getting the Basics

Let’s compute some basic statistics.

This number is the number of rows in the dataset, and can be obtained via `count()`.

iris.count()output:
sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
species              150
dtype: int64

The sample mean is the arithmetic mean of the dataset.

iris.mean()    # Sample mean for every numeric columnoutput:
sepal length (cm)    5.843333 
sepal width (cm)     3.054000 
petal length (cm)    3.758667 
petal width (cm)     1.198667 
dtype: float64

The sample median is the “middle” data point, after ordering the dataset. Let X(i) represent ordered data X(1) is smallest, x(n) largest).

iris.median()    # Sample median for every numeric columnoutput:
sepal length (cm)    5.80 
sepal width (cm)     3.0 
petal length (cm)    4.35 
petal width (cm)     1.30
dtype: float64

The sample variance is a measure of dispersion, roughly the “average” squared distance of a data point from the mean. The standard deviation is the square root of the variance and interpreted as the “average” distance a data point is from the mean.

iris.var()    # Sample variance for every numeric columnoutput:
sepal length (cm)    0.685694 
sepal width (cm)     0.188004 
petal length (cm)    3.113179 
petal width (cm)     0.582414
dtype: float64iris.std()    # Sample standard deviance for every numeric columnoutput:
sepal length (cm)    0.828066
sepal width (cm)     0.433594
petal length (cm)    1.764420
petal width (cm)     0.763161
dtype: float64

The 𝑝th percentile is the number in the dataset such that roughly 𝑝% of the data is less than this number. This number is also referred to as a quantile.

iris.quantile(.1)   # The 10th percentileoutput:
sepal length (cm)    4.8 
sepal width (cm)     2.5 
petal length (cm)    1.4 
petal width (cm)     0.2 
Name: 0.1, dtype: float64iris.quantile(.95)    # The 95th percentileoutput:
sepal length (cm)    7.255 
sepal width (cm)     3.800 
petal length (cm)    6.100 
petal width (cm)     2.300 
Name: 0.95, dtype: float64iris.quantile(.75)    # Commonly known as the third quartileoutput:
sepal length (cm)    6.4 
sepal width (cm)     3.3 
petal length (cm)    5.1 
petal width (cm)     1.8 
Name: 0.75, dtype: float64iris.quantile(.25)    # Commonly known as the first quartileoutput:
sepal length (cm)    5.1 
sepal width (cm)     2.8 
petal length (cm)    1.6 
petal width (cm)     0.3 
Name: 0.25, dtype: float64

If 𝑄𝑖 denotes the 𝑖th quartile, the inner-quartile range (IQR) is the difference between the third quartile and the first quartile.

# There is no function for computing the IQR but it is nevertheless easy to obtainiris.quantile(.75) - iris.quantile(.25)output:
sepal length (cm)    1.3 
sepal width (cm)     0.5 
petal length (cm)    3.5 
petal width (cm)     1.5 
dtype: float64

Other interesting quantities include the maximum and minimum values.

iris.max()output:
sepal length (cm)          7.9
sepal width (cm)           4.4 
petal length (cm)          6.9 
petal width (cm)           2.5 
species              virginica 
dtype: objectiris.min()output:
sepal length (cm)       4.3 
sepal width (cm)          2 
petal length (cm)         1 
petal width (cm)        0.1 
species              setosa 
dtype: object

Many of these summaries work for grouped data as well.

iris_grps.mean()output:
 species |  sepal length | sepal width | petal length | petal width                                                                
setosa           5.006       3.418          1.464           0.244                 versicolor       5.936       2.770         4.260          1.326                 virginica       6.588       2.974         5.552           2.026
iris_grps.quantile(.75)iris_grps.quantile(.75) - iris_grps.quantile(.25)

Other Useful Methods

The method describe() gets a number of useful summaries for a dataset.

iris.describe()

# This also works well for grouped data.
iris_grps.describe()

If we want custom numerical summaries, we can write functions to compute them for Pandas Series then apply them to the columns of a DataFrame.

I demonstrate by writing a function that computes the range, which is the difference between the maximum and minimum of a dataset.

# Compute the range of a dataset
def range_stat(s):
    return s.max() - s.min()iris.iloc[:, 0:4].apply(range_stat)output:
sepal length (cm)    3.6 
sepal width (cm)     2.4 
petal length (cm)    5.9 
petal width (cm)     2.4 
dtype: float64# Use aggregate() for groups
iris_grps.aggregate(range_stat)output:

Thanks for reading. If this article is helpful please do subscribe this medium blog.

References:

Geeksforgeeks.

Descriptive Statistics with Pandas on iris data(beginner)

The following code loads in the packages we will need and also the `iris` dataset.

Getting the Basics

Other Useful Methods

Written by A AKSHAY