Basics of Pandas for beginners

Sairam Penjarla
7 min readMay 22, 2022

First, let us understand the most fundamental part of pandas. One that must come to you naturally as you keep practicing pandas. The two data types in pandas. There are two most used and fundamental data types that the pandas’ library supports.

  1. Series
  2. Dataframe

A pandas series is a one-dimensional data type. Meaning it has only one axis. But it would be more accurate to say that it has no axis. You can think of the pandas series as a list data type. Some of the built-in methods of the list are applicable for pandas series as well. We can create a pandas series by simply using any kind of a list. In the below example I’m using a NumPy array, but you can comment out the 4th line and run the code. Both will give us the same result. In line 5, You can give a name to your series. In the below example, I’ve given it a name as marks. But, This name is pretty much useless for now. That’s because we call the series using the variable name. Since there is no axis, We no longer require a separate name tag for our series. We shall see how the situation changes when it comes to the second pandas data type.

import numpy as npimport pandas as pddata = [1,2,3,4,5]data = np.array([85, 90, 70, 80])series = pd.Series(data=data,name="marks")print(series)
fig: Output

Now that we have an understanding of series, Let us, deep-dive, into data frames. A data frame is a datatype of pandas that is created by combining one or more series. Yes, You can use only one series to create a data frame. A data frame is a spreadsheet that we can use within python. Each series creates a column of itself. In the previous example, we have seen how we created a series named marks. Now when we create a data frame using that series, We can see that the name assigned to it is now the name of our column. If no name was provided, then pandas automatically assign the column names as 0,1,2,3,4…and so on.

data = np.array([85, 90, 70, 80])series = pd.Series(data=data,name="marks")dataframe = pd.DataFrame(data=series)print(dataframe)
fig: Output

Another way of creating a data frame is by using row values. In the below example You can observe that we have a list of sub-lists, Each sub-list contains all the values of a row. All these sub-lists must have an equal number of elements in them. Now when we create a data frame, The first row will have the values in the first sub-list, the second row with the second sub-list, and so on. In the second line, while creating the data frame, You can remove the column names part, pandas will automatically assign 0,1,2,3…..etc as the column names.

data = np.array([ [25, 85], [25, 90], [26, 70], [24, 80]])dataframe = pd.DataFrame(data=data, columns=["age", "marks"])print(dataframe)
fig: Output

You can fetch the values of a data frame using two methods. Either by calling it in a list or by calling it as a method. Both the methods give us the same result but the first method is used more often than the second one for obvious reasons. You can provide the column name with the help of a variable.

cname = 'age'# 1. calling it in a listdataframe[cname]# ordataframe['age']# 2. calling it in a methoddataframe.age# Uncomment the next line to get an error# dataframe.cname
fig: Output

For the next few examples, I’m going to use a dataset that is pretty popular among beginners. Wine dataset from SKLearn. let us load the data frame and store it in a variable named df. Most of the internet uses the variable name df to store data frames.

# Load some dataimport pandas as pdfrom sklearn.datasets import load_winewine = load_wine()df = pd.DataFrame(data=wine.data, columns=wine.feature_names)df
fig: Output

Let us now fetch a few columns from this data frame. We can do this either by giving a single column name as a string or as a list of strings.

# Select the 'alcohol column'print(df['alcohol'])print(type(df['alcohol']))
fig: Output
df[['alcohol', 'ash', 'hue']]
fig: Output

LOC Method

let us learn about the “loc” method. This method of a data frame is very useful to fetch columns with the help of “Column names”. The general syntax of this data frame is
df [ starting row : ending row, starting column : ending column ]. You can also select a subset using the following syntax for a single column, df[1:100, “column_name”]

df.loc[:, 'alcohol']
fig: Output
df.loc[:, ['alcohol', 'ash', 'hue']]
fig: Output
df.loc[0, 'alcohol']
fig: Output
df.loc[[0, 5, 100], ['alcohol', 'ash', 'hue']]
fig: Output

iloc method

Let us now learn another similar method, iloc. The i in the iloc stands for index. And it works the same as the loc function. All of the methods from loc can also be used with iloc. The only difference is that we must address the columns using their index numbers rather than the column name.

df.iloc[:, 0]
fig: Output
df.iloc[:, [0, 2, 10]]
fig: Output
df.iloc[:, 0:5]
fig: Output
df.shape# Output: (178, 13)a = df.iloc[:, 12]a
fig: Output
b = df.iloc[:, -1]b
fig: Output
print( a==b )print("_"*100)print(type(a == b))print('_'*100)all(a == b)
fig: Output
print(type(df.iloc[0]))df.iloc[0]
fig: Output
print(type(df.iloc[[0]]))df.iloc[[0]]
fig: Output
df.iloc[[0, 25, 100]]
fig: Output
df.iloc[[-1, -2, -3, -4, -5]]
fig: Output
df.iloc[0, 0]
fig: Output
df.iloc[[0, 5, 100], [0, 3, 7]]
fig: Output
df.iloc[0:6, 0:5]
fig: Output

Conditional dataframes

We can fetch or create a sub-set of our data frame with the help of conditional statements. You can write any conditional statement and the statement will be executed for each row or each column and returns a boolean value. We can then use this boolean array to filter out the rows that have a ‘False’ value. Meaning, those who do not obey the condition.
To understand it better, let us write a simple conditional statement. In the below example you can observe that we are comparing all the rows from the alcohol column and checking if their value is greater than 14.3. The result is a pandas series of boolean values. We can then use these boolean values to fetch the rows that have a ‘True condition.

df['alcohol'] > 14.3
fig: Output
df[df['alcohol'] > 14.3]
fig: Output
df.loc[df['alcohol'] > 14.3, ['alcohol', 'ash', 'hue']]
fig: Output

You can also use logical statements such as logical and, logical is, between.

df[(df['alcohol'] > 14.3) & (df['alcohol'] < 14.6)]
fig: Output
df[df['alcohol'].between(14.3, 14.6)]
fig: Output
df[(df['alcohol'] > 14.3) & (df['hue'] > 1.0)]
fig: Output
df[ (df['alcohol'] > 14.5) | (df['hue'] > 1.4) ]
fig: Output

idxmin and idxmax are two very often used methods that will return the minimum and maximum values of a column respectively.

df['alcohol'].idxmin()
df['alcohol'].idxmax()
df.iloc[[df['alcohol'].idxmin(), df['alcohol'].idxmax()]]
fig: Output

--

--

Sairam Penjarla

Looking for my next opportunity to make change in a BIG way