Descriptive Statistics with Pandas on iris data(beginner)
Hello,This article is mainly to understand idea behind Descriptive Statistics. Here i am explaining theory and code for the same. I hope readers will enjoy.
Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset’s properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.
I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).
The following code loads in the packages we will need and also the `iris` dataset.
import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris
# sklearn.datasetsincludes common example datasets
# A function to load in the iris dataset
iris_obj = load_iris()
# Dataset preview
iris_obj.data # Names of the columns
iris_obj.feature_names # Target variable
iris_obj.target # Target names
iris_obj.target_names # name of target variable
`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.
iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))iris # prints iris datairis.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)iris # prints labeled data
For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.
iris_grps = iris.groupby("species")for name, data in iris_grps:
print(name)
print(\"---------------------\\n\\n\")
print(data.iloc[:, 0:4])
print(\"\\n\\n\\n\")
A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.
Getting the Basics
Let’s compute some basic statistics.
This number is the number of rows in the dataset, and can be obtained via `count()`.
iris.count()output:
sepal length (cm) 150
sepal width (cm) 150
petal length (cm) 150
petal width (cm) 150
species 150
dtype: int64
The sample mean is the arithmetic mean of the dataset.
iris.mean() # Sample mean for every numeric columnoutput:
sepal length (cm) 5.843333
sepal width (cm) 3.054000
petal length (cm) 3.758667
petal width (cm) 1.198667
dtype: float64
The sample median is the “middle” data point, after ordering the dataset. Let X(i) represent ordered data X(1) is smallest, x(n) largest).
iris.median() # Sample median for every numeric columnoutput:
sepal length (cm) 5.80
sepal width (cm) 3.0
petal length (cm) 4.35
petal width (cm) 1.30
dtype: float64
The sample variance is a measure of dispersion, roughly the “average” squared distance of a data point from the mean. The standard deviation is the square root of the variance and interpreted as the “average” distance a data point is from the mean.
iris.var() # Sample variance for every numeric columnoutput:
sepal length (cm) 0.685694
sepal width (cm) 0.188004
petal length (cm) 3.113179
petal width (cm) 0.582414
dtype: float64iris.std() # Sample standard deviance for every numeric columnoutput:
sepal length (cm) 0.828066
sepal width (cm) 0.433594
petal length (cm) 1.764420
petal width (cm) 0.763161
dtype: float64
The 𝑝th percentile is the number in the dataset such that roughly 𝑝% of the data is less than this number. This number is also referred to as a quantile.
iris.quantile(.1) # The 10th percentileoutput:
sepal length (cm) 4.8
sepal width (cm) 2.5
petal length (cm) 1.4
petal width (cm) 0.2
Name: 0.1, dtype: float64iris.quantile(.95) # The 95th percentileoutput:
sepal length (cm) 7.255
sepal width (cm) 3.800
petal length (cm) 6.100
petal width (cm) 2.300
Name: 0.95, dtype: float64iris.quantile(.75) # Commonly known as the third quartileoutput:
sepal length (cm) 6.4
sepal width (cm) 3.3
petal length (cm) 5.1
petal width (cm) 1.8
Name: 0.75, dtype: float64iris.quantile(.25) # Commonly known as the first quartileoutput:
sepal length (cm) 5.1
sepal width (cm) 2.8
petal length (cm) 1.6
petal width (cm) 0.3
Name: 0.25, dtype: float64
If 𝑄𝑖 denotes the 𝑖th quartile, the inner-quartile range (IQR) is the difference between the third quartile and the first quartile.
# There is no function for computing the IQR but it is nevertheless easy to obtainiris.quantile(.75) - iris.quantile(.25)output:
sepal length (cm) 1.3
sepal width (cm) 0.5
petal length (cm) 3.5
petal width (cm) 1.5
dtype: float64
Other interesting quantities include the maximum and minimum values.
iris.max()output:
sepal length (cm) 7.9
sepal width (cm) 4.4
petal length (cm) 6.9
petal width (cm) 2.5
species virginica
dtype: objectiris.min()output:
sepal length (cm) 4.3
sepal width (cm) 2
petal length (cm) 1
petal width (cm) 0.1
species setosa
dtype: object
Many of these summaries work for grouped data as well.
iris_grps.mean()output:
species | sepal length | sepal width | petal length | petal width
setosa 5.006 3.418 1.464 0.244 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026
iris_grps.quantile(.75)iris_grps.quantile(.75) - iris_grps.quantile(.25)
Other Useful Methods
The method describe()
gets a number of useful summaries for a dataset.
iris.describe()
# This also works well for grouped data.
iris_grps.describe()
If we want custom numerical summaries, we can write functions to compute them for Pandas Series
then apply them to the columns of a DataFrame
.
I demonstrate by writing a function that computes the range, which is the difference between the maximum and minimum of a dataset.
# Compute the range of a dataset
def range_stat(s):
return s.max() - s.min()iris.iloc[:, 0:4].apply(range_stat)output:
sepal length (cm) 3.6
sepal width (cm) 2.4
petal length (cm) 5.9
petal width (cm) 2.4
dtype: float64# Use aggregate() for groups
iris_grps.aggregate(range_stat)output:
Thanks for reading. If this article is helpful please do subscribe this medium blog.
References:
- Geeksforgeeks.