Descriptive Statistics with Pandas on iris data(beginner)

A AKSHAY
5 min readFeb 4, 2020

Hello,This article is mainly to understand idea behind Descriptive Statistics. Here i am explaining theory and code for the same. I hope readers will enjoy.

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset’s properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

The following code loads in the packages we will need and also the `iris` dataset.

import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris
# sklearn.datasetsincludes common example datasets
# A function to load in the iris dataset
iris_obj = load_iris()
# Dataset preview
iris_obj.data # Names of the columns
iris_obj.feature_names # Target variable
iris_obj.target # Target names
iris_obj.target_names # name of target variable

`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))iris # prints iris datairis.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)iris # prints labeled data

For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

iris_grps = iris.groupby("species")for name, data in iris_grps:
print(name)
print(\"---------------------\\n\\n\")
print(data.iloc[:, 0:4])
print(\"\\n\\n\\n\")

A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.

Getting the Basics

Let’s compute some basic statistics.

This number is the number of rows in the dataset, and can be obtained via `count()`.

iris.count()output:
sepal length (cm) 150
sepal width (cm) 150
petal length (cm) 150
petal width (cm) 150
species 150
dtype: int64

The sample mean is the arithmetic mean of the dataset.

iris.mean()    # Sample mean for every numeric columnoutput:
sepal length (cm) 5.843333
sepal width (cm) 3.054000
petal length (cm) 3.758667
petal width (cm) 1.198667
dtype: float64

The sample median is the “middle” data point, after ordering the dataset. Let X(i) represent ordered data X(1) is smallest, x(n) largest).

iris.median()    # Sample median for every numeric columnoutput:
sepal length (cm) 5.80
sepal width (cm) 3.0
petal length (cm) 4.35
petal width (cm) 1.30
dtype: float64

The sample variance is a measure of dispersion, roughly the “average” squared distance of a data point from the mean. The standard deviation is the square root of the variance and interpreted as the “average” distance a data point is from the mean.

iris.var()    # Sample variance for every numeric columnoutput:
sepal length (cm) 0.685694
sepal width (cm) 0.188004
petal length (cm) 3.113179
petal width (cm) 0.582414
dtype: float64
iris.std() # Sample standard deviance for every numeric columnoutput:
sepal length (cm) 0.828066
sepal width (cm) 0.433594
petal length (cm) 1.764420
petal width (cm) 0.763161
dtype: float64

The 𝑝th percentile is the number in the dataset such that roughly 𝑝% of the data is less than this number. This number is also referred to as a quantile.

iris.quantile(.1)   # The 10th percentileoutput:
sepal length (cm) 4.8
sepal width (cm) 2.5
petal length (cm) 1.4
petal width (cm) 0.2
Name: 0.1, dtype: float64
iris.quantile(.95) # The 95th percentileoutput:
sepal length (cm) 7.255
sepal width (cm) 3.800
petal length (cm) 6.100
petal width (cm) 2.300
Name: 0.95, dtype: float64
iris.quantile(.75) # Commonly known as the third quartileoutput:
sepal length (cm) 6.4
sepal width (cm) 3.3
petal length (cm) 5.1
petal width (cm) 1.8
Name: 0.75, dtype: float64
iris.quantile(.25) # Commonly known as the first quartileoutput:
sepal length (cm) 5.1
sepal width (cm) 2.8
petal length (cm) 1.6
petal width (cm) 0.3
Name: 0.25, dtype: float64

If 𝑄𝑖 denotes the 𝑖th quartile, the inner-quartile range (IQR) is the difference between the third quartile and the first quartile.

# There is no function for computing the IQR but it is nevertheless easy to obtainiris.quantile(.75) - iris.quantile(.25)output:
sepal length (cm) 1.3
sepal width (cm) 0.5
petal length (cm) 3.5
petal width (cm) 1.5
dtype: float64

Other interesting quantities include the maximum and minimum values.

iris.max()output:
sepal length (cm) 7.9
sepal width (cm) 4.4
petal length (cm) 6.9
petal width (cm) 2.5
species virginica
dtype: object
iris.min()output:
sepal length (cm) 4.3
sepal width (cm) 2
petal length (cm) 1
petal width (cm) 0.1
species setosa
dtype: object

Many of these summaries work for grouped data as well.

iris_grps.mean()output:
species | sepal length | sepal width | petal length | petal width
setosa 5.006 3.418 1.464 0.244 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026

iris_grps.quantile(.75)
iris_grps.quantile(.75) - iris_grps.quantile(.25)

Other Useful Methods

The method describe() gets a number of useful summaries for a dataset.

iris.describe()
# This also works well for grouped data.
iris_grps.describe()

If we want custom numerical summaries, we can write functions to compute them for Pandas Series then apply them to the columns of a DataFrame.

I demonstrate by writing a function that computes the range, which is the difference between the maximum and minimum of a dataset.

# Compute the range of a dataset
def range_stat(s):
return s.max() - s.min()
iris.iloc[:, 0:4].apply(range_stat)output:
sepal length (cm) 3.6
sepal width (cm) 2.4
petal length (cm) 5.9
petal width (cm) 2.4
dtype: float64
# Use aggregate() for groups
iris_grps.aggregate(range_stat)
output:

Thanks for reading. If this article is helpful please do subscribe this medium blog.

References:

  1. Geeksforgeeks.

--

--