Juypter Notebook — Part 2

Divya Sikka
4 min readFeb 23, 2022

--

Pandas library in Python to explore data

Explore data using Pandas in Jupyter Notebook

Prerequisite: Please review Jupyter Notebook — Part 1 to setup your Jupyter Notebook environment.

Get Pandas

Pandas is one of the open-source libraries of Python that is used for data analysis and data manipulation. It can be used to read, write, explore and visualize data.

Pandas does not come with a regular Python install. Install Pandas as follows:

  • Run Command Prompt as administrator.
  • Enter “pip install pandas

Download an example dataset

Download the Iris dataset locally. It is one of the most popular datasets that is used for learning data analysis.

Download the CSV file (source: https://datahub.io/machine-learning/iris#resource-iris)

Data Exploration

Using Jupyter Notebook, explore the dataset to understand the data:

Import pandas

Code: import pandas as pd

Read the dataset

Code: iris_dataframe = pd.read_csv(“<path to downloaded csv>”)

Some common Pandas functions that can be used to explore the data

head() — the function displays the top 5 rows of the dataset

Code: iris_dataframe.head()

Output:

sample(n) — function displays n rows from the dataset but randomly

Code: iris_dataframe.sample(10)

Output:

shape() — the function returns the number of rows and columns in the dataset

Code: iris_dataframe.shape

Output:

columns() — functions displays all the columns of the dataset

Code: iris_dataframe.columns

Output:

Display specific rows

# this example prints rows 5 to 10

Code: iris_dataframe[5:11]

Output:

Display specific columns

# this example prints first 10 rows for only columns Id and Species

Code: iris_dataframe[[“Id”,”Species”]].head(10)

Output:

Select data or filter data

loc() is label-based. You have to specify the name of the row or column to select or filter data when using loc().

# In this example, filter on data where Species is Iris-setosa and PetalWidthCm >0.4

Code: iris_dataframe.loc[(iris_dataframe[“Species”] == “Iris-setosa”) & (iris_dataframe[“PetalWidthCm”]>0.4)]

Output:

# In this example, use loc() to select rows 11 to 13

Code: iris_dataframe.loc[11:13]

Output:

iloc() is index-based. You have to specify the row or column by their integer index when using iloc().

#In this example, select row with index 5

Code: iris_dataframe.iloc[5]

Output:

Calculate sum, mean, median for a specific column

Code:

col_sum = iris_dataframe[“PetalWidthCm”].sum()

col_mean = iris_dataframe[“PetalWidthCm”].mean()

col_median = iris_dataframe[“PetalWidthCm”].median()

print(“Sum:”,col_sum, “\nMean:”, col_mean, “\nMedian:”,col_median)

Output:

Get min, max for a specific column

Code:

col_min=iris_dataframe[“PetalWidthCm”].min()

col_max=iris_dataframe[“PetalWidthCm”].max()

print(“Minimum:”,col_min, “\nMaximum:”, col_max)

Output:

value_counts() — function counts the number of times particular value occurs.

Code: iris_dataframe[“Species”].value_counts()

Output:

Data Manipulation

Add columns

Code:

iris_dataframe[“new_col”]=iris_dataframe[“PetalWidthCm”]*10

iris_dataframe.head()

Output:

Rename columns

Code:

renanmedcols={

“SepalLengthCm”:”sepalLength”,

“SepalWidthCm”:”sepalWidth”,

“PetalLengthCm”:”petalLength”,

“PetalWidthCm”:”petalWidth”}

iris_dataframe.rename(columns=renanmedcols,inplace=True)

iris_dataframe.head()

Output:

Conditional formatting

Code: iris_dataframe.head(10).style.highlight_max()

Output:

Find and remove missing values

isnull() — will display True for missing data, else False

Code: iris_dataframe.isnull()

Output:

#this example will tell us the number of missing values in each column

Code: iris_dataframe.isnull().sum()

Output:

These are some of the functions you can use to explore and manipulate your data to prepare for data analysis.

--

--