Indexing

Database indexes are specialized data structures that function as roadmaps to your data, dramatically improving the speed of data retrieval operations. Think of an index like the index at the back of a book - instead of flipping through every page to find a topic, you can quickly locate the exact page number you need.

However, indexes are not without cost. They consume additional storage space and can impact the performance of write operations since the database must maintain these structures whenever data changes. Understanding when and how to use indexes effectively is crucial for database optimization.

1. Types of Database Indexes

1.1 B-tree Indexes: The Workhorses of Database Systems

B-tree (Balanced Tree) indexes represent the most widely used indexing method in relational databases. Their popularity stems from their versatility and efficiency across a broad range of query types.

Structure and Design Philosophy

B-tree indexes maintain a hierarchical, balanced structure where each node can contain multiple keys and pointers to child nodes. The “balanced” aspect is crucial - it ensures that the path from root to any leaf node is the same length, guaranteeing consistent performance regardless of which data you’re accessing.

Here’s how a typical B-tree structure looks:

                    [50]
                  /      \
              [25]        [75]
             /    \      /    \
         [10]     [30] [60]   [80]
        /  \     /  \ /  \   /  \
      [5] [15] [28][35][55][65][78][85]

Each internal node acts as a decision point, directing searches toward the appropriate subtree. Leaf nodes contain the actual data or pointers to the data rows.

Operational Characteristics

  • Range Queries Excellence: B-trees excel at range queries because of their sorted nature. When you execute a query like WHERE age BETWEEN 25 AND 35, the database can quickly navigate to the starting point (25) and then sequentially read through the sorted structure until it reaches the endpoint (35).
  • Equality Lookups: For exact matches, B-trees provide logarithmic time complexity O(log N), which means even with millions of records, you typically need only a handful of comparisons to find your target.
  • Sequential Access: Since leaf nodes are often linked together, B-trees support efficient sequential scanning for operations like ORDER BY clauses.

SQL Implementation Example

-- Creating a B-tree index (default type in most databases)
CREATE INDEX idx_customer_name ON Customers (name);

-- This index will accelerate queries like:
SELECT * FROM Customers WHERE name = 'John Smith';
SELECT * FROM Customers WHERE name BETWEEN 'A' AND 'M';
SELECT * FROM Customers ORDER BY name;

1.2 Hash Indexes: Speed Champions for Exact Matches

Hash indexes take a fundamentally different approach, using mathematical hash functions to directly map keys to storage locations. This direct mapping enables constant-time complexity O(1) for equality searches - theoretically the fastest possible lookup time.

Structure and Hash Function Mechanics

The hash index structure is elegantly simple:

Key Input → Hash Function → Bucket Location
─────────────────────────────────────────
"john@email.com" → hash() → Bucket 847
"mary@email.com" → hash() → Bucket 231  
"bob@email.com"  → hash() → Bucket 592

The hash function takes your search key and mathematically transforms it into a bucket number. The quality of the hash function is critical - it should distribute keys evenly across buckets to avoid clustering and maintain performance.

Operational Characteristics

  • Lightning-Fast Equality: Hash indexes shine when you need exact matches. Looking up a user by email address or finding a product by SKU are perfect use cases.
  • No Range Support: The hash function destroys any natural ordering of the data. You cannot use hash indexes for range queries, sorting, or pattern matching.
  • Collision Handling: When two different keys hash to the same bucket (a collision), the database must implement collision resolution strategies, which can slightly impact performance.

SQL Implementation Example

-- Creating a hash index in PostgreSQL
CREATE INDEX idx_user_email_hash ON Users USING hash (email);

-- This index will accelerate:
SELECT * FROM Users WHERE email = 'user@example.com';

-- But NOT these queries:
SELECT * FROM Users WHERE email LIKE 'user%';
SELECT * FROM Users ORDER BY email;

Comparative Analysis: B-tree vs Hash Indexes

┌──────────────────┬─────────────────────┬─────────────────────┐
│     Feature      │    B-tree Indexes   │    Hash Indexes     │
├──────────────────┼─────────────────────┼─────────────────────┤
│ Suitable Queries │ Range and equality  │ Equality only       │
│ Time Complexity  │ Logarithmic O(logN) │ Constant O(1)       │
│ Data Ordering    │ Yes (sorted)        │ No (unordered)      │
│ Storage Overhead │ Higher (tree nodes) │ Lower (hash table)  │
│ Range Queries    │ Excellent           │ Not supported       │
│ Exact Lookups    │ Very good           │ Excellent           │
│ Memory Usage     │ Moderate to high    │ Generally lower     │
│ Maintenance Cost │ Moderate            │ Lower               │
└──────────────────┴─────────────────────┴─────────────────────┘

2. Performance Trade-offs and Strategic Considerations

2.1 The Read vs Write Performance Balancing Act

Read Performance Benefits

Indexes transform database queries from linear scans to targeted lookups. Consider a table with 1 million customer records:

  • Without Index: Finding a customer by name requires scanning all 1 million records (worst case)
  • With B-tree Index: Finding the same customer requires approximately 20 comparisons (log₂ 1,000,000 ≈ 20)

This represents a 50,000x performance improvement in the worst-case scenario.

Write Performance Costs

Every index comes with a maintenance burden. When you insert, update, or delete data:

  1. INSERT Operations: The database must update every relevant index, potentially splitting B-tree nodes or rehashing in hash indexes
  2. UPDATE Operations: If indexed columns change, the database must remove old index entries and create new ones
  3. DELETE Operations: Index entries must be removed, and tree structures may need rebalancing

Storage Overhead Reality

Indexes are not free in terms of storage:

  • A B-tree index typically consumes 15-25% additional space relative to the indexed data
  • Hash indexes generally require less space, but still represent overhead
  • Composite indexes (covering multiple columns) can be substantially larger

2.2 Strategic Index Selection Guidelines

Choose B-tree Indexes When:

Range Queries Dominate: If your application frequently uses BETWEEN, <, >, or ORDER BY clauses

-- These queries benefit from B-tree indexes
SELECT * FROM Orders WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
SELECT * FROM Products WHERE price > 100 ORDER BY price;

Sorting Requirements: When query results need ordered output Partial String Matching: For LIKE queries with leading characters

SELECT * FROM Customers WHERE name LIKE 'Smith%';

Choose Hash Indexes When:

Equality-Heavy Workloads: When most queries are exact lookups

-- Perfect for hash indexes
SELECT * FROM Users WHERE user_id = 12345;
SELECT * FROM Sessions WHERE session_token = 'abc123def456';

Memory Constraints: When storage space is at a premium High-Volume Lookups: In scenarios requiring maximum lookup speed for exact matches

2.3 Advanced Index Optimization Strategies

Composite Index Design

Composite indexes cover multiple columns and can dramatically improve multi-condition queries:

-- Single composite index can serve multiple query patterns
CREATE INDEX idx_orders_customer_date ON Orders (customer_id, order_date);

-- Efficiently serves these queries:
SELECT * FROM Orders WHERE customer_id = 123;
SELECT * FROM Orders WHERE customer_id = 123 AND order_date = '2024-01-15';
SELECT * FROM Orders WHERE customer_id = 123 AND order_date > '2024-01-01';

Column Order Matters: Place the most selective columns first in composite indexes.

Index Maintenance Best Practices

Regular Analysis: Use database-specific tools to analyze index usage:

-- PostgreSQL example
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch 
FROM pg_stat_user_indexes;

Rebuilding Indexes: Periodically rebuild indexes to eliminate fragmentation:

-- PostgreSQL
REINDEX INDEX idx_customer_name;

-- SQL Server
ALTER INDEX idx_customer_name ON Customers REBUILD;

Selective Indexing Strategy: Focus on columns that appear frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses.

3. Common Pitfalls and How to Avoid Them

Over-Indexing Syndrome

Creating indexes on every column seems safe, but leads to:

  • Excessive memory consumption
  • Slower write operations
  • Increased maintenance overhead
  • Storage bloat

Solution: Analyze actual query patterns and create indexes only for frequently used access patterns.

Under-Indexing Problems

Failing to index properly results in:

  • Full table scans on large datasets
  • Poor application response times
  • Inefficient resource utilization

Solution: Monitor slow query logs and identify missing index opportunities.

Ignoring Query Pattern Analysis

Creating indexes without understanding how your application queries data leads to:

  • Unused indexes consume resources
  • Missing indexes for common query patterns
  • Suboptimal performance

Solution: Use query analysis tools and examine actual application query patterns before designing your indexing strategy.

Track your progress

Mark this subtopic as completed when you finish reading.