Database Indexing Explained: B-Trees, Composite Indexes, and Query Optimization
A query that takes 4 seconds without an index takes 0.2ms with one. That's a 20,000x improvement from a single line of SQL. Indexes are the most impactful performance tool available to application developers — and the most misunderstood. Many developers add indexes reactively (when a query is already slow in production), never remove unused ones, and don't understand why index column order matters.
This article explains how B-tree indexes work internally, how to choose which columns to index, the rules for composite index column ordering, how to read an EXPLAIN plan to diagnose slow queries, and the four indexing mistakes that silently kill write performance.
How B-Tree Indexes Work
A B-tree (Balanced Tree) index is a self-balancing tree where every leaf node holds the indexed value and a pointer to the corresponding table row (the heap tuple in PostgreSQL). The tree keeps all leaves at the same depth, so lookups always cost O(log n) time regardless of table size. A table with 1 billion rows and a B-tree index needs at most 30 comparisons to find a row — compared to a full sequential scan of up to 1 billion rows.
When PostgreSQL evaluates a query, the query planner compares the cost of a sequential scan (read every page) against an index scan (follow the B-tree, then fetch specific pages). For highly selective queries (returning <5% of rows), index scans win. For queries returning most of the table, sequential scans win — reading 80% of the table via index causes more random I/O than a single sequential pass.
Quick reference
- B-tree: default index type, supports =, <, >, BETWEEN, LIKE 'prefix%', ORDER BY, and range queries.
- Hash index: only supports exact = equality. Faster for equality-only lookups, but no range queries.
- GIN (Generalized Inverted Index): for array, JSONB, and full-text search (any value → which rows contain it).
- GiST: for geometric types, full-text search (tsvector), and PostGIS spatial data.
- BRIN (Block Range Index): for naturally ordered data (timestamps). Tiny size, poor selectivity.
- Index overhead: every index adds a write cost on INSERT, UPDATE, DELETE. Each write must update the index tree.
Remember this
B-tree is the right default. Hash for pure equality, GIN for arrays/JSONB, GiST for spatial/text. Don't add index types you don't understand.
Which Columns to Index
Not every column needs an index. The question is: does this column appear in WHERE, JOIN ON, or ORDER BY clauses in queries that are hot (run frequently or on large tables)? High-cardinality columns — those with many distinct values — benefit most from indexing. A user_id column with millions of distinct values is an excellent index candidate. A status column with three values (active, inactive, deleted) is a poor candidate — querying by status still returns most of the table.
Foreign keys are the most commonly missed index target. PostgreSQL does not automatically create indexes on foreign key columns, but every cascade delete, JOIN, and referential integrity check on a foreign key causes a sequential scan without one.
Quick reference
- High selectivity: index a column where most values are distinct (user_id, email, uuid).
- Low selectivity: don't index boolean, status (3 values), or gender columns — full scan is faster.
- Always index foreign keys — PostgreSQL doesn't do this automatically.
- Partial index: index only a subset of rows WHERE status = 'active'. Smaller, faster for filtered queries.
- Expression index: CREATE INDEX ON users(lower(email)) for case-insensitive lookups.
- pg_stat_user_indexes: query this to find indexes that are never used (idx_scan = 0).
Remember this
Index WHERE clause columns on large tables, always index foreign keys, and remove unused indexes. High cardinality = good candidate.
Composite Indexes and Column Order
A composite index covers multiple columns. The rule that trips up most developers: a composite index on (a, b, c) can satisfy queries that filter on a, on (a, b), or on (a, b, c) — but NOT on (b), (c), or (b, c) alone. The leftmost prefix of the index must be present in the WHERE clause. This is called the leftmost prefix rule.
For a composite index, put the most selective column first when queries filter on individual columns. When queries always filter on a fixed column AND range on another, put the equality column first and the range column second — that way the equality filter narrows the B-tree, and the range scan runs within that narrow set.
Quick reference
- Leftmost prefix rule: (a,b,c) index covers WHERE a=1, WHERE a=1 AND b=2, but NOT WHERE b=2 alone.
- Equality before range: put = filter columns before BETWEEN / > / < columns in a composite index.
- Covering index: include all columns the query reads — avoids heap fetch, enables index-only scan.
- INCLUDE clause (Postgres 11+): CREATE INDEX ON orders(user_id) INCLUDE (status, total) for covering index.
- Index for ORDER BY: include sort column as last index column to enable index scan instead of filesort.
- Use EXPLAIN to verify: look for 'Index Only Scan' (best), 'Index Scan' (good), 'Seq Scan' (investigate).
Remember this
Put equality columns before range columns. Know the leftmost prefix rule. Add INCLUDE columns for covering indexes on hot queries.
Reading EXPLAIN ANALYZE
EXPLAIN ANALYZE is your primary diagnostic tool for slow queries. It shows the query plan the planner chose, the estimated vs actual row counts, and the time spent at each node. The most important numbers: actual rows (too high = bad estimate, bad index), loops (how many times a node ran), and the slowest node by total time.
A huge discrepancy between estimated rows and actual rows indicates stale statistics — run ANALYZE to refresh them. A Seq Scan node on a large table with a small actual row count usually means a missing index or a query that defeats index usage (function on the indexed column, implicit type cast, OR condition).
Quick reference
- EXPLAIN shows estimated cost. EXPLAIN ANALYZE actually runs the query and shows real timings.
- cost=X..Y: X is startup cost, Y is total cost. Lower is better. Compare across nodes to find bottleneck.
- actual time=X..Y: X is time to first row, Y is total time for this node. Highest Y = slowest node.
- loops=N: node ran N times. actual time × loops = total time for that node.
- Rows Removed by Filter: high numbers here mean missing index or non-sargable predicate.
- Buffers: hit=N (from cache), read=N (from disk). Many disk reads → poor cache hit rate or missing index.
- Run ANALYZE tablename after bulk inserts to update statistics so the planner makes good decisions.
Remember this
EXPLAIN ANALYZE before and after adding indexes. Look for Seq Scan on large tables, high 'Rows Removed by Filter', and estimated vs actual row count mismatches.
Four Indexing Mistakes That Kill Performance
Over-indexing is a real problem. Every index you add increases INSERT, UPDATE, and DELETE time because each write must update every index on the table. A table with 15 indexes can write 15 times slower than a table with 3. And unused indexes consume disk space, inflate backups, and confuse the query planner.
The four most common mistakes: indexing a low-cardinality column (status with 3 values), never removing unused indexes, using a function on an indexed column in a WHERE clause (defeating the index), and ignoring OR conditions (which split queries across two index scans).
Quick reference
- Never add an index 'just in case' — profile first, index second.
- Monitor pg_stat_user_indexes.idx_scan — zero scans after a week of traffic = unused index, drop it.
- LIKE 'prefix%' uses a B-tree index. LIKE '%suffix' and LIKE '%middle%' do not — use pg_trgm GIN.
- OR between different columns often can't use a single index. UNION ALL is frequently faster.
- Partial indexes: CREATE INDEX ON orders(user_id) WHERE status = 'active' — much smaller, much faster for active-record queries.
- Concurrent index creation: CREATE INDEX CONCURRENTLY — doesn't lock the table, safe for production.
Remember this
Remove unused indexes, never use functions on indexed columns in WHERE without a matching expression index, and profile before adding any new index.
Indexes are a trade between read speed and write overhead. The discipline is adding them intentionally — profile slow queries with EXPLAIN ANALYZE, add the minimum indexes needed to make them fast, and audit for unused ones monthly.
The four rules that cover 90% of indexing decisions: always index foreign keys, put equality columns before range columns in composite indexes, create expression indexes when your WHERE clause uses a function, and drop indexes with zero scans. Get those right and most query performance problems disappear before they reach production.
Related Articles
Explore this topic