Pandas for Data Analysis: The Practical Guide
Master the essential Pandas operations for real data work — loading, cleaning, transforming, and summarizing data with practical examples.
Why Pandas
Before Pandas, data analysis in Python was painful. NumPy arrays don't handle mixed types or missing values well. Pandas solved both problems with the DataFrame — a two-dimensional table that handles the messy reality of real data.
Pandas is the first library any data scientist installs. Virtually every dataset you'll work with starts with pd.read_csv().
Core Concepts
Series
A one-dimensional labeled array:
import pandas as pd
scores = pd.Series([85, 92, 78, 95, 88],
index=['Alice', 'Bob', 'Carol', 'Dave', 'Eve'])
print(scores['Bob']) # 92
print(scores.mean()) # 87.6
DataFrame
A two-dimensional table — the workhorse of Pandas:
data = {
'name': ['Alice', 'Bob', 'Carol', 'Dave'],
'score': [85, 92, 78, 95],
'course': ['Python', 'ML', 'Python', 'DS'],
'passed': [True, True, False, True]
}
df = pd.DataFrame(data)
print(df.head())
name score course passed
0 Alice 85 Python True
1 Bob 92 ML True
2 Carol 78 Python False
3 Dave 95 DS True
Loading Data
# CSV — most common
df = pd.read_csv('students.csv')
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON
df = pd.read_json('data.json')
# From a URL
df = pd.read_csv('https://example.com/data.csv')
Exploring Your Data
Always start by understanding what you have:
print(df.shape) # (rows, columns)
print(df.dtypes) # data types per column
print(df.info()) # full summary
print(df.describe()) # statistics for numeric columns
print(df.head(10)) # first 10 rows
print(df.isnull().sum()) # missing values per column
describe() is particularly useful — it shows count, mean, std, min, quartiles, and max for every numeric column in one call.
Selecting Data
Selecting Columns
# Single column → Series
names = df['name']
# Multiple columns → DataFrame
subset = df[['name', 'score']]
Selecting Rows
# By index
first_row = df.iloc[0] # first row (integer position)
first_five = df.iloc[:5] # first 5 rows
# By label
row = df.loc[0] # row with index label 0
Filtering Rows
This is where Pandas shines:
# Students who passed
passed = df[df['passed'] == True]
# Scores above 80
high_scores = df[df['score'] > 80]
# Python course students who passed
python_pass = df[(df['course'] == 'Python') & (df['passed'] == True)]
# Multiple values with isin
top_courses = df[df['course'].isin(['Python', 'ML'])]
Data Cleaning
Real data is messy. Pandas gives you the tools to fix it.
Handling Missing Values
# Check missing values
print(df.isnull().sum())
# Drop rows with any missing values
df_clean = df.dropna()
# Drop columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df) * 0.8)
# Fill missing values
df['score'].fillna(df['score'].mean(), inplace=True) # fill with mean
df['course'].fillna('Unknown', inplace=True) # fill with value
Handling Duplicates
print(df.duplicated().sum()) # count duplicates
df = df.drop_duplicates() # remove duplicates
Type Conversion
df['score'] = df['score'].astype(float)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category') # memory efficient
Renaming Columns
df.rename(columns={'name': 'student_name', 'score': 'test_score'}, inplace=True)
# Or standardize all at once
df.columns = df.columns.str.lower().str.replace(' ', '_')
Transforming Data
Adding Columns
# Derived column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')
# Vectorized operations (faster than apply for numeric)
df['score_normalized'] = (df['score'] - df['score'].min()) / (df['score'].max() - df['score'].min())
Groupby: The Power Tool
GroupBy is Pandas' most powerful feature for summarization:
# Average score by course
avg_by_course = df.groupby('course')['score'].mean()
# Multiple aggregations
summary = df.groupby('course').agg(
avg_score=('score', 'mean'),
max_score=('score', 'max'),
student_count=('name', 'count'),
pass_rate=('passed', 'mean')
)
print(summary)
avg_score max_score student_count pass_rate
course
DS 90.0 95 2 1.0
ML 88.5 92 2 0.5
Python 82.0 85 3 0.67
Sorting
df_sorted = df.sort_values('score', ascending=False)
df_sorted = df.sort_values(['course', 'score'], ascending=[True, False])
Merging DataFrames
Like SQL joins:
students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
scores = pd.DataFrame({'id': [1, 2, 4], 'score': [85, 92, 78]})
# Inner join — only matching rows
merged = pd.merge(students, scores, on='id', how='inner')
# Left join — all students, NaN if no score
merged = pd.merge(students, scores, on='id', how='left')
Exporting Results
df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)
df.to_json('results.json', orient='records')
A Complete Mini-Analysis
import pandas as pd
# Load
df = pd.read_csv('students.csv')
# Clean
df = df.dropna(subset=['score'])
df['score'] = df['score'].astype(float)
df.columns = df.columns.str.lower()
# Transform
df['grade'] = pd.cut(df['score'],
bins=[0, 60, 70, 80, 90, 100],
labels=['F', 'D', 'C', 'B', 'A'])
# Summarize
summary = df.groupby('course').agg(
students=('name', 'count'),
avg_score=('score', 'mean'),
top_score=('score', 'max')
).round(1)
print(summary)
summary.to_csv('course_summary.csv')
This pattern — load, clean, transform, summarize, export — is the backbone of 80% of data analysis work.
What to Learn Next
- Matplotlib & Seaborn — visualize your DataFrames
- NumPy — the array library underneath Pandas
- Handling time series data — Pandas has excellent datetime tools
- Pivot tables —
df.pivot_table()for cross-tabulation
Pandas fluency comes from practice. Take a real dataset (Kaggle has thousands of free ones), load it with Pandas, and work through these operations on data you actually find interesting.