# Part 9: AGGREGATION AND GROUPING

Aggregation: We will study aggregations like `sum()`, `mean()`, `median()`, `min()`, and `max()`, in which a single number gives insight into the nature of a potentially large dataset **Grouping: **When we are dealing with large datasets, it is useful to group data by common categories (value of particular column). To accomplish this, we use the `GroupBy` function pandas DataFrame

ðŸ›³ Titanic Ship Data for Demonstration

• We will use Titanic survivors data from our analysis
``````import numpy as np
import pandas as pd
``````
``````titanic = pd.read_csv('data/titanic.csv')

# fetching column names using .keys method
print(titanic.keys())
``````
``````Index(['survived', 'pclass', 'sex', 'age', 'fare', 'embarked', 'who',
'embark_town', 'alive', 'alone'],
dtype='object')
``````
``````print(titanic.head()) # show the header, includes first five rows
``````
``````   survived  pclass     sex   age     fare embarked    who  embark_town alive  \
0         0       3    male  22.0   7.2500        S    man  Southampton    no
1         1       1  female  38.0  71.2833        C  woman    Cherbourg   yes
2         1       3  female  26.0   7.9250        S  woman  Southampton   yes
3         1       1  female  35.0  53.1000        S  woman  Southampton   yes
4         0       3    male  35.0   8.0500        S    man  Southampton    no
alone
0  False
1  False
2   True
3  False
4   True
``````

Article Contents

## 1. SIMPLE AGGREGATION

**Features of Dataset: ** Columns of a DataFrame is referred to as features of a dataset, these features can be:

• Quantitative feature: Any value represented by numbers. We apply numerical metrics (sum, mean, std) on them
• Categorical feature: Values are categories that can be used to group the dataset. GroupBy is applied on categorical features

### 1.1. describe()

Computes several common aggregates for each column (containing integers and floats) i.e, `.describe` applies to categorical features

``````print(titanic.describe())
``````
``````         survived      pclass         age        fare
count  891.000000  891.000000  714.000000  891.000000
mean     0.383838    2.308642   29.699118   32.204208
std      0.486592    0.836071   14.526497   49.693429
min      0.000000    1.000000    0.420000    0.000000
25%      0.000000    2.000000   20.125000    7.910400
50%      0.000000    3.000000   28.000000   14.454200
75%      1.000000    3.000000   38.000000   31.000000
max      1.000000    3.000000   80.000000  512.329200
``````

### 1.2. Fetching Unique Values

For categorical features, we can use `.unique()` method to obtain unique instances of each categorical feature:

``````# fetching unique values under column 'embark_station'
titanic['embark_town'].unique()
``````
``````array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)
``````

### 1.3. Frequency of Unique Values

For categorical features, we can use `.value_counts()` method to obtain frequency counts of each category in column feature

``````# fetching all unique instances of column 'embark_town'
# along with the frequency count
titanic['embark_town'].value_counts()
``````
``````Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64
``````

We can use `normalize=True` argument to get proportion of frequency count

``````# sum of all values is equal to 1
titanic['embark_town'].value_counts(normalize=True)
``````
``````Southampton    0.724409
Cherbourg      0.188976
Queenstown     0.086614
Name: embark_town, dtype: float64
``````

## 2. GROUPBY: SPLIT, APPLY, COMBINE

`GroupBy`, split-apply-combine; is one of the most common and useful strategy of data analysis:

• The split step involves breaking up and grouping a DataFrame depending on the values of the specified column
• The apply step involves computing some function – usually an aggregate, transformation, or filtering; within the individual groups. Apply can take following forms:
• Aggregation
• Tranformation
• Filteration
• The combine step merges the results of these operations into an output array.

### 2.1. Grouping by Single Column

#### a. Split

Split (Creating a DataFrame GroupBy Object): Let suppose we would like to `GroupBy` column name `sex`. Doing this will create a GroupBy object which stores the data of the individual groups in the form of key value pairs, which we can fetch using `.groups` method on this DataFrame object

``````#grouping by column 'sex'
group_by_sex = titanic.groupby('sex')

# the output is DataFrameGroupBy object
group_by_sex
``````
``````<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9778b203d0>
``````
``````# getting all groups as key-value pair
group_by_sex.groups
``````
``````{'female': Int64Index([  1,   2,   3,   8,   9,  10,  11,  14,  15,  18,
...
866, 871, 874, 875, 879, 880, 882, 885, 887, 888],
dtype='int64', length=314),
'male': Int64Index([  0,   4,   5,   6,   7,  12,  13,  16,  17,  20,
...
873, 876, 877, 878, 881, 883, 884, 886, 889, 890],
dtype='int64', length=577)}
``````

#### b. Apply-Combine

Suppose we are interested to know `.mean()` of all columns values, once the data is grouped by `sex` This step will apply the mean to each instance of group and then combine the result to give us a Series (for single column selection) or DataFrame (for multiple columns selection)

``````titanic.groupby('sex').mean()
``````
``````        survived    pclass        age       fare     alone
sex
female  0.742038  2.159236  27.915709  44.479818  0.401274
male    0.188908  2.389948  30.726645  25.523893  0.712305
``````

âžž Let suppose further that instead of getting mean of all column values, we are only interested in finding mean of values under column `survived` and `age`

``````print(titanic.groupby('sex')['survived','age'].mean())
``````
``````        survived        age
sex
female  0.742038  27.915709
male    0.188908  30.726645
``````

âžž In addition, we can also fetch specific group using `get_group('group_name_here')` and apply aggregator on that object:

``````group_by_sex.get_group('male').mean()
``````
``````survived     0.188908
pclass       2.389948
age         30.726645
fare        25.523893
alone        0.712305
dtype: float64
``````

### 2.2. Grouping by two Columns

We are not limited to GroupBy single column. In this example, we group the DataFrame by columns `sex` and `alive` and then apply the `mean` aggregator. Further, we are only interested to get the output on `age` and `fare` columns

``````print(titanic.groupby(['sex','alive'])['age','fare'].mean())
``````
``````                    age       fare
sex    alive
female no     25.046875  23.024385
yes    28.847716  51.938573
male   no     31.618056  21.960993
yes    27.276022  40.821484
``````

### 2.3. Iteration over groups

We have discussed earlier that GroupBy function creates a GroupBy object, which stores the values in dictionary style key-value pair. To demonstrate this concept, we will apply a `for` loop on `GroupBy` object

``````# let get the shape of each group, when we GroupBy 'pclass'
for (name, group) in titanic.groupby('pclass'):
print(f"{name}: shape={group.shape}")
``````
``````1: shape=(216, 10)
2: shape=(184, 10)
3: shape=(491, 10)
``````

## 3. AGGREGATE, FILTER, TRANSFORM, APPLY

GroupBy objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods that efficiently implement a variety of useful operations before combining the grouped data

First step is to construct a DataFrame for this discussion

``````rand = np.random.RandomState(42)
df = pd.DataFrame({'key': ['A','B','C','A','B','C'],
'data1': rand.randint(1,10, size=6),
'data2': rand.randint(1,20, size=6)})

print(df)
``````
``````  key  data1  data2
0   A      7     11
1   B      4     11
2   C      8      4
3   A      5      8
4   B      7      3
5   C      3      2
``````

### 3.1. Aggregation

In the above example, we just calculated a single aggregator(mean), however using the `aggregate()` method, we can compute multiple aggregators in a single command. The `aggregate()` method takes a string, a function or a list of all the required aggregates to compute.

``````# passing list of aggregators we need for the data1 and data 2
df.groupby('key').aggregate([min,max,np.median,np.mean,np.std])
``````
``````    data1                           data2
min max median mean       std   min max median mean       std
key
A       5   7    6.0  6.0  1.414214     8  11    9.5  9.5  2.121320
B       4   7    5.5  5.5  2.121320     3  11    7.0  7.0  5.656854
C       3   8    5.5  5.5  3.535534     2   4    3.0  3.0  1.414214
``````

Let suppose, we donâ€™t want to compute one type of aggregator for all columns, rather, we want to compute median for `data1` and `mean` for `data2`

``````# we will pass the dictionary of key(column-name) and value(aggregator)
df.groupby('key').aggregate({'data1':np.median,
'data2':np.mean})
``````
``````     data1  data2
key
A      6.0    9.5
B      5.5    7.0
C      5.5    3.0
``````

âžž Now, as we grasp the concept of `aggregate()` method, let apply this on our titanic data

``````titanic.groupby('sex')['age','fare'].aggregate([min,max,np.mean])
``````
``````         age                   fare
min   max       mean   min       max       mean
sex
female  0.75  63.0  27.915709  6.75  512.3292  44.479818
male    0.42  80.0  30.726645  0.00  512.3292  25.523893
``````

### 3.2 Filtering

A filtering operation allows us to drop data based on some group properties Filter is applied in form of function

Letâ€™s apply `filter` to drop rows where standard deviation of `data1` is greater than 2

``````df.groupby('key').filter(lambda x: x['data1'].std() > 2)
``````
``````  key  data1  data2
1   B      4     11
2   C      8      4
4   B      7      3
5   C      3      2
``````

Letâ€™s find out the standard deviation of each key, so we can say for sure that key `A` is dropped because standard deviation of `data1` is less than `2`

``````print(df.groupby('key').std())
``````
``````        data1     data2
key
A    1.414214  2.121320
B    2.121320  5.656854
C    3.535534  1.414214
``````

### 3.3. Transformation

In the above example, when we applied the `aggregate` function, we end up with reduced version of the data (For example, in `aggregate` example above, we end up with 3 rows from 6 rows). However, `transformation` function/method can return some transformed version of the full data but the output remains the same shape as the input.

Suppose we would like to create a new column that list sum of each key for â€˜data1â€™ and â€˜data2â€™

``````df_new_col = df.groupby('key').transform('sum')
print(df_new_col)
``````
``````   data1  data2
0     12     19
1     11     14
2     11      6
3     12     19
4     11     14
5     11      6
``````

### 3.4. Apply

The `apply()` method applies function to the group results. In the example below, let me add another column `data3` which is sum of corresponding row value of columns `data1` and `data2`

``````def apply_func(x):
x['data3'] = x['data1'] + x['data2']
return x

print(df.groupby('key').apply(apply_func))
``````
``````  key  data1  data2  data3
0   A      7     11     18
1   B      4     11     15
2   C      8      4     12
3   A      5      8     13
4   B      7      3     10
5   C      3      2      5
``````