# Descriptive Statistics with Pandas on iris data(beginner)

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset’s properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

## The following code loads in the packages we will need and also the `iris` dataset.

`import pandas as pdfrom pandas import DataFramefrom sklearn.datasets import load_iris   # sklearn.datasetsincludes common example datasets# A function to load in the iris datasetiris_obj = load_iris()   # Dataset previewiris_obj.data              # Names of the columnsiris_obj.feature_names     # Target variableiris_obj.target            # Target namesiris_obj.target_names      # name of target variable`

`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.

`iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape)])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape)])))iris # prints iris datairis.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)iris # prints labeled data`

For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

`iris_grps = iris.groupby("species")for name, data in iris_grps:    print(name)    print(\"---------------------\\n\\n\")    print(data.iloc[:, 0:4])    print(\"\\n\\n\\n\")`

A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.

# Getting the Basics

Let’s compute some basic statistics.

This number is the number of rows in the dataset, and can be obtained via `count()`.

`iris.count()output:sepal length (cm)    150sepal width (cm)     150petal length (cm)    150petal width (cm)     150species              150dtype: int64`

The sample mean is the arithmetic mean of the dataset.

`iris.mean()    # Sample mean for every numeric columnoutput:sepal length (cm)    5.843333 sepal width (cm)     3.054000 petal length (cm)    3.758667 petal width (cm)     1.198667 dtype: float64`

The sample median is the “middle” data point, after ordering the dataset. Let X(i) represent ordered data X(1) is smallest, x(n) largest).

`iris.median()    # Sample median for every numeric columnoutput:sepal length (cm)    5.80 sepal width (cm)     3.0 petal length (cm)    4.35 petal width (cm)     1.30dtype: float64`

The sample variance is a measure of dispersion, roughly the “average” squared distance of a data point from the mean. The standard deviation is the square root of the variance and interpreted as the “average” distance a data point is from the mean.

`iris.var()    # Sample variance for every numeric columnoutput:sepal length (cm)    0.685694 sepal width (cm)     0.188004 petal length (cm)    3.113179 petal width (cm)     0.582414dtype: float64iris.std()    # Sample standard deviance for every numeric columnoutput:sepal length (cm)    0.828066sepal width (cm)     0.433594petal length (cm)    1.764420petal width (cm)     0.763161dtype: float64`

The 𝑝th percentile is the number in the dataset such that roughly 𝑝% of the data is less than this number. This number is also referred to as a quantile.

`iris.quantile(.1)   # The 10th percentileoutput:sepal length (cm)    4.8 sepal width (cm)     2.5 petal length (cm)    1.4 petal width (cm)     0.2 Name: 0.1, dtype: float64iris.quantile(.95)    # The 95th percentileoutput:sepal length (cm)    7.255 sepal width (cm)     3.800 petal length (cm)    6.100 petal width (cm)     2.300 Name: 0.95, dtype: float64iris.quantile(.75)    # Commonly known as the third quartileoutput:sepal length (cm)    6.4 sepal width (cm)     3.3 petal length (cm)    5.1 petal width (cm)     1.8 Name: 0.75, dtype: float64iris.quantile(.25)    # Commonly known as the first quartileoutput:sepal length (cm)    5.1 sepal width (cm)     2.8 petal length (cm)    1.6 petal width (cm)     0.3 Name: 0.25, dtype: float64`

If 𝑄𝑖 denotes the 𝑖th quartile, the inner-quartile range (IQR) is the difference between the third quartile and the first quartile.

`# There is no function for computing the IQR but it is nevertheless easy to obtainiris.quantile(.75) - iris.quantile(.25)output:sepal length (cm)    1.3 sepal width (cm)     0.5 petal length (cm)    3.5 petal width (cm)     1.5 dtype: float64`

Other interesting quantities include the maximum and minimum values.

`iris.max()output:sepal length (cm)          7.9sepal width (cm)           4.4 petal length (cm)          6.9 petal width (cm)           2.5 species              virginica dtype: objectiris.min()output:sepal length (cm)       4.3 sepal width (cm)          2 petal length (cm)         1 petal width (cm)        0.1 species              setosa dtype: object`

Many of these summaries work for grouped data as well.

`iris_grps.mean()output: species |  sepal length | sepal width | petal length | petal width                                                                setosa           5.006       3.418          1.464           0.244                 versicolor       5.936       2.770         4.260          1.326                 virginica       6.588       2.974         5.552           2.026iris_grps.quantile(.75)iris_grps.quantile(.75) - iris_grps.quantile(.25)`

# Other Useful Methods

The method `describe()` gets a number of useful summaries for a dataset.

`iris.describe()`
`# This also works well for grouped data.iris_grps.describe()`

If we want custom numerical summaries, we can write functions to compute them for Pandas `Series` then apply them to the columns of a `DataFrame`.

I demonstrate by writing a function that computes the range, which is the difference between the maximum and minimum of a dataset.

`# Compute the range of a datasetdef range_stat(s):    return s.max() - s.min()iris.iloc[:, 0:4].apply(range_stat)output:sepal length (cm)    3.6 sepal width (cm)     2.4 petal length (cm)    5.9 petal width (cm)     2.4 dtype: float64# Use aggregate() for groupsiris_grps.aggregate(range_stat)output:`