Understand Pandas Groupby function in the most easiest way
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
For those who are familiar with oops concepts or are looking for a proper workflow, Pandas creates groups of DataFrame filtered by the unique values in the column specified and stores these data frames as key value pairs in an object and combines all the objects into a single object.
Read the above line once again after you go through this whole article.
Groupby works my doing three operations. Split, Apply, Combine. Let’s deep dive into how and what happens in each operation.
# Split
Let us consider a dataframe.
import pandas as pd
df = pd.read_csv("/Users/sairampinjarla/Desktop/sample.csv")
print(df)
Let’s take the unique values in the city column
uniq_vals = df['City'].unique()print(uniq_vals)# ['delhi' 'hyderabad' 'chennai']
Now, Let’s filter the dataframe and store them as key value pairs
groups = {}for i in uniq_vals: temp_df = (df[df["City"] == i]) # we will drop the city column as we are using it to do the
grouping. So It won't make any sense to have it in the temporary dataframe temp_df.drop(['City'], axis = 1, inplace = True)
groups[i] = temp_df
# Apply
let’s take any one dataframe from the groups
city = uniq_vals[1]print(city)# hyderabad
and apply the max() function
hyd_df = groups[city]print(hyd_df)
max_values = hyd_df.max()print(max_values)
Let’s create a dictionary out of this
temp_dict = { "S.no": max_values[0], "Humidity": max_values[1], "Temp": max_values[2], "Moon": max_values[3]}df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)
Let me write the whole process in a single cell
city = uniq_vals[1]
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
"S.no": max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)
Let’s repeat the same for all cities
for city in uniq_vals:
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
'S.no': max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}
df_after_apply = pd.DataFrame(temp_dict, index = [city])
print(df_after_apply)
# Combine
let us combine all these three dataframe into a single dataframe
for city in uniq_vals:
hyd_df = groups[city]
my_columns = ['S.no', 'Humidity', 'Temp', 'Moon']
max_values = hyd_df.max()
temp_dict = {
'S.no': max_values[0],
"Humidity": max_values[1],
"Temp": max_values[2],
"Moon": max_values[3]
}
temp_df = pd.DataFrame(temp_dict, index = [city])
final_df = pd.concat([final_df, temp_df])
print(final_df)
final_df.rename_axis("City")
# Lets use groupby to do all of this in one line
df.groupby('City').max()
# Additional
As I’ve written in the very beginning, groupby stores the filtered dataframes as objects with key value pairs.