Understand Pandas Groupby function in the most easiest way

Sairam Penjarla
4 min readOct 9, 2022

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

For those who are familiar with oops concepts or are looking for a proper workflow, Pandas creates groups of DataFrame filtered by the unique values in the column specified and stores these data frames as key value pairs in an object and combines all the objects into a single object.

Read the above line once again after you go through this whole article.

Groupby works my doing three operations. Split, Apply, Combine. Let’s deep dive into how and what happens in each operation.

# Split

Let us consider a dataframe.

import pandas as pd
df = pd.read_csv("/Users/sairampinjarla/Desktop/sample.csv")
print(df)

Let’s take the unique values in the city column

uniq_vals = df['City'].unique()print(uniq_vals)# ['delhi' 'hyderabad' 'chennai']

Now, Let’s filter the dataframe and store them as key value pairs

groups = {}for i in uniq_vals:    temp_df = (df[df["City"] == i])    # we will drop the city column as we are using it to do the      
grouping. So It won't make any sense to have it in the temporary dataframe
temp_df.drop(['City'], axis = 1, inplace = True)
groups[i] = temp_df

# Apply

let’s take any one dataframe from the groups

city = uniq_vals[1]print(city)# hyderabad

and apply the max() function

hyd_df = groups[city]print(hyd_df)
max_values = hyd_df.max()print(max_values)

Let’s create a dictionary out of this

temp_dict = {    "S.no": max_values[0],    "Humidity": max_values[1],    "Temp": max_values[2],    "Moon": max_values[3]}df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)

Let me write the whole process in a single cell

city = uniq_vals[1]
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
"S.no": max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}
df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)

Let’s repeat the same for all cities

for city in uniq_vals:
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
'S.no': max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}
df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)

# Combine

let us combine all these three dataframe into a single dataframe

for city in uniq_vals:
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
'S.no': max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}
temp_df = pd.DataFrame(temp_dict, index = [city])
final_df = pd.concat([final_df, temp_df])
print(final_df)
final_df.rename_axis("City")

# Lets use groupby to do all of this in one line

df.groupby('City').max()

# Additional

As I’ve written in the very beginning, groupby stores the filtered dataframes as objects with key value pairs.

--

--

Sairam Penjarla

Looking for my next opportunity to make change in a BIG way