Juypter Notebook — Part 2
Explore data using Pandas in Jupyter Notebook
Prerequisite: Please review Jupyter Notebook — Part 1 to setup your Jupyter Notebook environment.
Get Pandas
Pandas is one of the open-source libraries of Python that is used for data analysis and data manipulation. It can be used to read, write, explore and visualize data.
Pandas does not come with a regular Python install. Install Pandas as follows:
- Run Command Prompt as administrator.
- Enter “pip install pandas”
Download an example dataset
Download the Iris dataset locally. It is one of the most popular datasets that is used for learning data analysis.
Download the CSV file (source: https://datahub.io/machine-learning/iris#resource-iris)
Data Exploration
Using Jupyter Notebook, explore the dataset to understand the data:
Import pandas
Code: import pandas as pd
Read the dataset
Code: iris_dataframe = pd.read_csv(“<path to downloaded csv>”)
Some common Pandas functions that can be used to explore the data
head() — the function displays the top 5 rows of the dataset
Code: iris_dataframe.head()
Output:
sample(n) — function displays n rows from the dataset but randomly
Code: iris_dataframe.sample(10)
Output:
shape() — the function returns the number of rows and columns in the dataset
Code: iris_dataframe.shape
Output:
columns() — functions displays all the columns of the dataset
Code: iris_dataframe.columns
Output:
Display specific rows
# this example prints rows 5 to 10
Code: iris_dataframe[5:11]
Output:
Display specific columns
# this example prints first 10 rows for only columns Id and Species
Code: iris_dataframe[[“Id”,”Species”]].head(10)
Output:
Select data or filter data
loc() is label-based. You have to specify the name of the row or column to select or filter data when using loc().
# In this example, filter on data where Species is Iris-setosa and PetalWidthCm >0.4
Code: iris_dataframe.loc[(iris_dataframe[“Species”] == “Iris-setosa”) & (iris_dataframe[“PetalWidthCm”]>0.4)]
Output:
# In this example, use loc() to select rows 11 to 13
Code: iris_dataframe.loc[11:13]
Output:
iloc() is index-based. You have to specify the row or column by their integer index when using iloc().
#In this example, select row with index 5
Code: iris_dataframe.iloc[5]
Output:
Calculate sum, mean, median for a specific column
Code:
col_sum = iris_dataframe[“PetalWidthCm”].sum()
col_mean = iris_dataframe[“PetalWidthCm”].mean()
col_median = iris_dataframe[“PetalWidthCm”].median()
print(“Sum:”,col_sum, “\nMean:”, col_mean, “\nMedian:”,col_median)
Output:
Get min, max for a specific column
Code:
col_min=iris_dataframe[“PetalWidthCm”].min()
col_max=iris_dataframe[“PetalWidthCm”].max()
print(“Minimum:”,col_min, “\nMaximum:”, col_max)
Output:
value_counts() — function counts the number of times particular value occurs.
Code: iris_dataframe[“Species”].value_counts()
Output:
Data Manipulation
Add columns
Code:
iris_dataframe[“new_col”]=iris_dataframe[“PetalWidthCm”]*10
iris_dataframe.head()
Output:
Rename columns
Code:
renanmedcols={
“SepalLengthCm”:”sepalLength”,
“SepalWidthCm”:”sepalWidth”,
“PetalLengthCm”:”petalLength”,
“PetalWidthCm”:”petalWidth”}
iris_dataframe.rename(columns=renanmedcols,inplace=True)
iris_dataframe.head()
Output:
Conditional formatting
Code: iris_dataframe.head(10).style.highlight_max()
Output:
Find and remove missing values
isnull() — will display True for missing data, else False
Code: iris_dataframe.isnull()
Output:
#this example will tell us the number of missing values in each column
Code: iris_dataframe.isnull().sum()
Output:
These are some of the functions you can use to explore and manipulate your data to prepare for data analysis.
Resources: