Indexes

An index in a database is a data structure that improves the speed of data retrieval operations on a database table. Think of it like an index in a book that helps you quickly find specific topics without reading every page. Similarly, a database index creates shortcuts to locate specific rows without scanning the entire table.

When you create an index on a column, the database engine builds a separate structure that maintains a sorted reference to the actual data rows. This structure allows the database to quickly locate rows that match your query conditions, dramatically reducing the time needed for data retrieval.

Types of Database Indexes

Clustered Indexes

A clustered index determines the physical storage order of data rows in a table. The table data is stored in the order of the clustered index key. Each table can have only one clustered index because the data rows can be stored in only one order.

When you create a clustered index on a column like employee_id, the database physically arranges all rows in the table based on the values in that column. This means that employees with similar IDs are stored close to each other on disk, making range queries extremely efficient.

-- Creating a clustered index on employee_id
CREATE CLUSTERED INDEX IX_Employee_ID ON Employees(employee_id);

The primary advantage of clustered indexes is that they eliminate the need for additional lookups when retrieving data. Once the index finds the correct location, the actual data is right there because the index and the table data are essentially the same structure.

Non-Clustered Indexes

Non-clustered indexes create a separate structure that contains the indexed column values and pointers to the actual data rows. Unlike clustered indexes, you can create multiple non-clustered indexes on a single table.

Think of a non-clustered index like a library catalog system. The catalog tells you where to find a book, but you still need to go to that location to get the actual book. Similarly, a non-clustered index tells the database engine where to find the row, but it requires an additional step to retrieve the actual data.

-- Creating a non-clustered index on last_name
CREATE NONCLUSTERED INDEX IX_Employee_LastName ON Employees(last_name);

Unique Indexes

Unique indexes ensure that no duplicate values exist in the indexed columns while also providing the performance benefits of regular indexes. When you create a primary key constraint, the database automatically creates a unique clustered index.

-- Creating a unique index on email addresses
CREATE UNIQUE INDEX IX_Employee_Email ON Employees(email);

This index serves dual purposes: it prevents duplicate email addresses from being inserted and provides fast lookups when searching by email.

Composite Indexes

Composite indexes span multiple columns and are particularly useful when your queries frequently filter or sort by multiple columns together. The order of columns in a composite index matters significantly for query performance.

-- Creating a composite index on department and hire_date
CREATE INDEX IX_Employee_Dept_HireDate ON Employees(department_id, hire_date);

This composite index is most effective for queries that filter by department_id, or by both department_id and hire_date. It’s less effective for queries that only filter by hire_date because that column isn’t the leading column in the index.

How Indexes Work Internally

B-Tree Structure

Most database indexes use a B-tree (balanced tree) structure. A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in logarithmic time.

In a B-tree index, data is organized in a hierarchy of pages. The root page contains key values that divide the data into ranges, and each range points to an intermediate page. This continues until you reach the leaf pages, which contain the actual index entries.

When you search for a specific value, the database starts at the root page and follows the appropriate path down through the intermediate pages until it reaches the leaf page containing your target value. This process typically requires only a few page reads, even for tables with millions of rows.

Index Selectivity

Index selectivity refers to how well an index can narrow down the search results. High selectivity means the index can eliminate most rows from consideration, while low selectivity means the index doesn’t help much in filtering data.

For example, an index on gender in an employee table has low selectivity because there are typically only two or three distinct values. An index on employee_id has high selectivity because each value is unique. Indexes with higher selectivity generally provide better performance improvements.

Performance Optimization Strategies

Query Execution Plans

Understanding query execution plans is crucial for performance optimization. An execution plan shows you exactly how the database engine processes your query, including which indexes it uses, how tables are joined, and where potential bottlenecks exist.

-- Viewing the execution plan for a query
EXPLAIN SELECT * FROM Employees WHERE department_id = 5 AND salary > 50000;

The execution plan reveals whether your query is performing table scans (reading every row) or index seeks (using indexes efficiently). Index scans indicate that the database is reading through an entire index, which is better than a table scan but not as efficient as an index seek.

Index Maintenance Considerations

Indexes require ongoing maintenance that impacts database performance. Every time you insert, update, or delete data, the database must also update all relevant indexes. This creates additional overhead for write operations.

Index fragmentation occurs over time as data is modified. When you insert new rows or update existing ones, the index pages may not have enough space to accommodate the changes in their ideal locations. This leads to page splits and fragmentation, which can degrade query performance.

Regular index maintenance involves rebuilding or reorganizing fragmented indexes. Rebuilding completely recreates the index structure, while reorganizing defragments the existing structure. The choice between these options depends on the level of fragmentation and available maintenance windows.

Statistics and Query Optimization

Database engines maintain statistics about the distribution of data in your tables and indexes. These statistics help the query optimizer choose the most efficient execution plan for your queries.

Outdated statistics can lead to poor query plans. For example, if the optimizer thinks a table has only 1000 rows when it actually has 1 million, it might choose a nested loop join instead of a more appropriate hash join. Keeping statistics current through regular updates is essential for optimal performance.

Practical Index Design Guidelines

Column Selection Strategy

Choose index columns based on your query patterns. Columns used in WHERE clauses, JOIN conditions, and ORDER BY clauses are good candidates for indexing. However, avoid indexing every column because each additional index increases storage requirements and slows down write operations.

Consider the data distribution when selecting columns. Columns with many distinct values benefit more from indexing than columns with few distinct values. A column containing mostly null values may not be a good indexing candidate unless you frequently search for non-null values.

Covering Indexes

A covering index includes all columns needed to satisfy a query, eliminating the need to access the actual table data. This technique can dramatically improve query performance by reducing the number of page reads required.

-- Creating a covering index for a specific query pattern
CREATE INDEX IX_Employee_Covering ON Employees(department_id, salary) 
INCLUDE (first_name, last_name, email);

This covering index can satisfy queries that filter by department_id and salary while returning first_name, last_name, and email without accessing the main table.

Index Intersection

Some database engines can use multiple single-column indexes together to satisfy a query through index intersection. While this provides flexibility, purpose-built composite indexes usually perform better than relying on index intersection.

Understanding these indexing concepts and performance optimization techniques will help you design efficient database schemas and write queries that perform well even as your data grows. The key is balancing the performance benefits of indexes against their maintenance overhead and storage costs.

Track your progress

Mark this subtopic as completed when you finish reading.