Pandas for Data Analysis: The Practical Guide

Why Pandas

Before Pandas, data analysis in Python was painful. NumPy arrays don't handle mixed types or missing values well. Pandas solved both problems with the DataFrame — a two-dimensional table that handles the messy reality of real data.

Pandas is the first library any data scientist installs. Virtually every dataset you'll work with starts with pd.read_csv().

Core Concepts

Series

A one-dimensional labeled array:

import pandas as pd

scores = pd.Series([85, 92, 78, 95, 88], 
                   index=['Alice', 'Bob', 'Carol', 'Dave', 'Eve'])
print(scores['Bob'])   # 92
print(scores.mean())   # 87.6

DataFrame

A two-dimensional table — the workhorse of Pandas:

data = {
    'name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'score': [85, 92, 78, 95],
    'course': ['Python', 'ML', 'Python', 'DS'],
    'passed': [True, True, False, True]
}

df = pd.DataFrame(data)
print(df.head())

    name  score  course  passed
0  Alice     85  Python    True
1    Bob     92      ML    True
2  Carol     78  Python   False
3   Dave     95      DS    True

Loading Data

# CSV — most common
df = pd.read_csv('students.csv')

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# From a URL
df = pd.read_csv('https://example.com/data.csv')

Exploring Your Data

Always start by understanding what you have:

print(df.shape)          # (rows, columns)
print(df.dtypes)         # data types per column
print(df.info())         # full summary
print(df.describe())     # statistics for numeric columns
print(df.head(10))       # first 10 rows
print(df.isnull().sum()) # missing values per column

describe() is particularly useful — it shows count, mean, std, min, quartiles, and max for every numeric column in one call.

Selecting Data

Selecting Columns

# Single column → Series
names = df['name']

# Multiple columns → DataFrame
subset = df[['name', 'score']]

Selecting Rows

# By index
first_row = df.iloc[0]         # first row (integer position)
first_five = df.iloc[:5]       # first 5 rows

# By label
row = df.loc[0]                # row with index label 0

Filtering Rows

This is where Pandas shines:

# Students who passed
passed = df[df['passed'] == True]

# Scores above 80
high_scores = df[df['score'] > 80]

# Python course students who passed
python_pass = df[(df['course'] == 'Python') & (df['passed'] == True)]

# Multiple values with isin
top_courses = df[df['course'].isin(['Python', 'ML'])]

Data Cleaning

Real data is messy. Pandas gives you the tools to fix it.

Handling Missing Values

# Check missing values
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Drop columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df) * 0.8)

# Fill missing values
df['score'].fillna(df['score'].mean(), inplace=True)  # fill with mean
df['course'].fillna('Unknown', inplace=True)            # fill with value

Handling Duplicates

print(df.duplicated().sum())  # count duplicates
df = df.drop_duplicates()      # remove duplicates

Type Conversion

df['score'] = df['score'].astype(float)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')  # memory efficient

Renaming Columns

df.rename(columns={'name': 'student_name', 'score': 'test_score'}, inplace=True)

# Or standardize all at once
df.columns = df.columns.str.lower().str.replace(' ', '_')

Transforming Data

Adding Columns

# Derived column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')

# Vectorized operations (faster than apply for numeric)
df['score_normalized'] = (df['score'] - df['score'].min()) / (df['score'].max() - df['score'].min())

Groupby: The Power Tool

GroupBy is Pandas' most powerful feature for summarization:

# Average score by course
avg_by_course = df.groupby('course')['score'].mean()

# Multiple aggregations
summary = df.groupby('course').agg(
    avg_score=('score', 'mean'),
    max_score=('score', 'max'),
    student_count=('name', 'count'),
    pass_rate=('passed', 'mean')
)

print(summary)

          avg_score  max_score  student_count  pass_rate
course                                                   
DS             90.0         95              2        1.0
ML             88.5         92              2        0.5
Python         82.0         85              3        0.67

Sorting

df_sorted = df.sort_values('score', ascending=False)
df_sorted = df.sort_values(['course', 'score'], ascending=[True, False])

Merging DataFrames

Like SQL joins:

students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
scores = pd.DataFrame({'id': [1, 2, 4], 'score': [85, 92, 78]})

# Inner join — only matching rows
merged = pd.merge(students, scores, on='id', how='inner')

# Left join — all students, NaN if no score
merged = pd.merge(students, scores, on='id', how='left')

Exporting Results

df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)
df.to_json('results.json', orient='records')

A Complete Mini-Analysis

import pandas as pd

# Load
df = pd.read_csv('students.csv')

# Clean
df = df.dropna(subset=['score'])
df['score'] = df['score'].astype(float)
df.columns = df.columns.str.lower()

# Transform
df['grade'] = pd.cut(df['score'], 
                      bins=[0, 60, 70, 80, 90, 100], 
                      labels=['F', 'D', 'C', 'B', 'A'])

# Summarize
summary = df.groupby('course').agg(
    students=('name', 'count'),
    avg_score=('score', 'mean'),
    top_score=('score', 'max')
).round(1)

print(summary)
summary.to_csv('course_summary.csv')

This pattern — load, clean, transform, summarize, export — is the backbone of 80% of data analysis work.

What to Learn Next

Matplotlib & Seaborn — visualize your DataFrames
NumPy — the array library underneath Pandas
Handling time series data — Pandas has excellent datetime tools
Pivot tables — df.pivot_table() for cross-tabulation

Pandas fluency comes from practice. Take a real dataset (Kaggle has thousands of free ones), load it with Pandas, and work through these operations on data you actually find interesting.