Data Science

Pandas for Data Analysis: The Practical Guide

Master the essential Pandas operations for real data work — loading, cleaning, transforming, and summarizing data with practical examples.

JRCodex··5 min read

Why Pandas

Before Pandas, data analysis in Python was painful. NumPy arrays don't handle mixed types or missing values well. Pandas solved both problems with the DataFrame — a two-dimensional table that handles the messy reality of real data.

Pandas is the first library any data scientist installs. Virtually every dataset you'll work with starts with pd.read_csv().

Core Concepts

Series

A one-dimensional labeled array:

import pandas as pd

scores = pd.Series([85, 92, 78, 95, 88], 
                   index=['Alice', 'Bob', 'Carol', 'Dave', 'Eve'])
print(scores['Bob'])   # 92
print(scores.mean())   # 87.6

DataFrame

A two-dimensional table — the workhorse of Pandas:

data = {
    'name': ['Alice', 'Bob', 'Carol', 'Dave'],
    'score': [85, 92, 78, 95],
    'course': ['Python', 'ML', 'Python', 'DS'],
    'passed': [True, True, False, True]
}

df = pd.DataFrame(data)
print(df.head())
    name  score  course  passed
0  Alice     85  Python    True
1    Bob     92      ML    True
2  Carol     78  Python   False
3   Dave     95      DS    True

Loading Data

# CSV — most common
df = pd.read_csv('students.csv')

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# From a URL
df = pd.read_csv('https://example.com/data.csv')

Exploring Your Data

Always start by understanding what you have:

print(df.shape)          # (rows, columns)
print(df.dtypes)         # data types per column
print(df.info())         # full summary
print(df.describe())     # statistics for numeric columns
print(df.head(10))       # first 10 rows
print(df.isnull().sum()) # missing values per column

describe() is particularly useful — it shows count, mean, std, min, quartiles, and max for every numeric column in one call.

Selecting Data

Selecting Columns

# Single column → Series
names = df['name']

# Multiple columns → DataFrame
subset = df[['name', 'score']]

Selecting Rows

# By index
first_row = df.iloc[0]         # first row (integer position)
first_five = df.iloc[:5]       # first 5 rows

# By label
row = df.loc[0]                # row with index label 0

Filtering Rows

This is where Pandas shines:

# Students who passed
passed = df[df['passed'] == True]

# Scores above 80
high_scores = df[df['score'] > 80]

# Python course students who passed
python_pass = df[(df['course'] == 'Python') & (df['passed'] == True)]

# Multiple values with isin
top_courses = df[df['course'].isin(['Python', 'ML'])]

Data Cleaning

Real data is messy. Pandas gives you the tools to fix it.

Handling Missing Values

# Check missing values
print(df.isnull().sum())

# Drop rows with any missing values
df_clean = df.dropna()

# Drop columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df) * 0.8)

# Fill missing values
df['score'].fillna(df['score'].mean(), inplace=True)  # fill with mean
df['course'].fillna('Unknown', inplace=True)            # fill with value

Handling Duplicates

print(df.duplicated().sum())  # count duplicates
df = df.drop_duplicates()      # remove duplicates

Type Conversion

df['score'] = df['score'].astype(float)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')  # memory efficient

Renaming Columns

df.rename(columns={'name': 'student_name', 'score': 'test_score'}, inplace=True)

# Or standardize all at once
df.columns = df.columns.str.lower().str.replace(' ', '_')

Transforming Data

Adding Columns

# Derived column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B' if x >= 80 else 'C')

# Vectorized operations (faster than apply for numeric)
df['score_normalized'] = (df['score'] - df['score'].min()) / (df['score'].max() - df['score'].min())

Groupby: The Power Tool

GroupBy is Pandas' most powerful feature for summarization:

# Average score by course
avg_by_course = df.groupby('course')['score'].mean()

# Multiple aggregations
summary = df.groupby('course').agg(
    avg_score=('score', 'mean'),
    max_score=('score', 'max'),
    student_count=('name', 'count'),
    pass_rate=('passed', 'mean')
)

print(summary)
          avg_score  max_score  student_count  pass_rate
course                                                   
DS             90.0         95              2        1.0
ML             88.5         92              2        0.5
Python         82.0         85              3        0.67

Sorting

df_sorted = df.sort_values('score', ascending=False)
df_sorted = df.sort_values(['course', 'score'], ascending=[True, False])

Merging DataFrames

Like SQL joins:

students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
scores = pd.DataFrame({'id': [1, 2, 4], 'score': [85, 92, 78]})

# Inner join — only matching rows
merged = pd.merge(students, scores, on='id', how='inner')

# Left join — all students, NaN if no score
merged = pd.merge(students, scores, on='id', how='left')

Exporting Results

df.to_csv('results.csv', index=False)
df.to_excel('results.xlsx', index=False)
df.to_json('results.json', orient='records')

A Complete Mini-Analysis

import pandas as pd

# Load
df = pd.read_csv('students.csv')

# Clean
df = df.dropna(subset=['score'])
df['score'] = df['score'].astype(float)
df.columns = df.columns.str.lower()

# Transform
df['grade'] = pd.cut(df['score'], 
                      bins=[0, 60, 70, 80, 90, 100], 
                      labels=['F', 'D', 'C', 'B', 'A'])

# Summarize
summary = df.groupby('course').agg(
    students=('name', 'count'),
    avg_score=('score', 'mean'),
    top_score=('score', 'max')
).round(1)

print(summary)
summary.to_csv('course_summary.csv')

This pattern — load, clean, transform, summarize, export — is the backbone of 80% of data analysis work.

What to Learn Next

  • Matplotlib & Seaborn — visualize your DataFrames
  • NumPy — the array library underneath Pandas
  • Handling time series data — Pandas has excellent datetime tools
  • Pivot tablesdf.pivot_table() for cross-tabulation

Pandas fluency comes from practice. Take a real dataset (Kaggle has thousands of free ones), load it with Pandas, and work through these operations on data you actually find interesting.

pandasdata-sciencepythonintermediate