DatabasesSQLPerformancePostgreSQLQuery Optimization

Database Indexing Explained: B-Trees, Composite Indexes, and Query Optimization

July 4, 202614 min read

A query that takes 4 seconds without an index takes 0.2ms with one. That's a 20,000x improvement from a single line of SQL. Indexes are the most impactful performance tool available to application developers — and the most misunderstood. Many developers add indexes reactively (when a query is already slow in production), never remove unused ones, and don't understand why index column order matters.

This article explains how B-tree indexes work internally, how to choose which columns to index, the rules for composite index column ordering, how to read an EXPLAIN plan to diagnose slow queries, and the four indexing mistakes that silently kill write performance.

How B-Tree Indexes Work

A B-tree (Balanced Tree) index is a self-balancing tree where every leaf node holds the indexed value and a pointer to the corresponding table row (the heap tuple in PostgreSQL). The tree keeps all leaves at the same depth, so lookups always cost O(log n) time regardless of table size. A table with 1 billion rows and a B-tree index needs at most 30 comparisons to find a row — compared to a full sequential scan of up to 1 billion rows.

When PostgreSQL evaluates a query, the query planner compares the cost of a sequential scan (read every page) against an index scan (follow the B-tree, then fetch specific pages). For highly selective queries (returning <5% of rows), index scans win. For queries returning most of the table, sequential scans win — reading 80% of the table via index causes more random I/O than a single sequential pass.

Quick reference

B-tree: default index type, supports =, <, >, BETWEEN, LIKE 'prefix%', ORDER BY, and range queries.
Hash index: only supports exact = equality. Faster for equality-only lookups, but no range queries.
GIN (Generalized Inverted Index): for array, JSONB, and full-text search (any value → which rows contain it).
GiST: for geometric types, full-text search (tsvector), and PostGIS spatial data.
BRIN (Block Range Index): for naturally ordered data (timestamps). Tiny size, poor selectivity.
Index overhead: every index adds a write cost on INSERT, UPDATE, DELETE. Each write must update the index tree.

Remember this

B-tree is the right default. Hash for pure equality, GIN for arrays/JSONB, GiST for spatial/text. Don't add index types you don't understand.

Which Columns to Index

Not every column needs an index. The question is: does this column appear in WHERE, JOIN ON, or ORDER BY clauses in queries that are hot (run frequently or on large tables)? High-cardinality columns — those with many distinct values — benefit most from indexing. A user_id column with millions of distinct values is an excellent index candidate. A status column with three values (active, inactive, deleted) is a poor candidate — querying by status still returns most of the table.

Foreign keys are the most commonly missed index target. PostgreSQL does not automatically create indexes on foreign key columns, but every cascade delete, JOIN, and referential integrity check on a foreign key causes a sequential scan without one.

Quick reference

High selectivity: index a column where most values are distinct (user_id, email, uuid).
Low selectivity: don't index boolean, status (3 values), or gender columns — full scan is faster.
Always index foreign keys — PostgreSQL doesn't do this automatically.
Partial index: index only a subset of rows WHERE status = 'active'. Smaller, faster for filtered queries.
Expression index: CREATE INDEX ON users(lower(email)) for case-insensitive lookups.
pg_stat_user_indexes: query this to find indexes that are never used (idx_scan = 0).

Before

Common slow queries that need indexes

1-- Sequential scan: no index on user_id2SELECT * FROM orders WHERE user_id = 123;3 4-- Sequential scan: no index on foreign key5SELECT o.*, u.email6FROM orders o7JOIN users u ON u.id = o.user_id;  -- full scan of orders to find user_id8 9-- Sequential scan: filtering by created_at range10SELECT * FROM events11WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01';12 13-- Sequential scan: sorting without index14SELECT * FROM products ORDER BY price ASC LIMIT 10;

After

Indexes that make the above queries fast

1-- Index on the lookup column2CREATE INDEX idx_orders_user_id ON orders(user_id);3 4-- Always index foreign keys PostgreSQL doesn't auto-create these5CREATE INDEX idx_orders_user_id ON orders(user_id);6 7-- Index on timestamp for range queries8CREATE INDEX idx_events_created_at ON events(created_at);9 10-- Index for ORDER BY + LIMIT (index-only sort)11CREATE INDEX idx_products_price ON products(price);12 13-- Check index usage14EXPLAIN (ANALYZE, BUFFERS)15SELECT * FROM orders WHERE user_id = 123;16-- Look for: "Index Scan using idx_orders_user_id" — good17-- Avoid: "Seq Scan" on large tables

Remember this

Index WHERE clause columns on large tables, always index foreign keys, and remove unused indexes. High cardinality = good candidate.

Composite Indexes and Column Order

A composite index covers multiple columns. The rule that trips up most developers: a composite index on (a, b, c) can satisfy queries that filter on a, on (a, b), or on (a, b, c) — but NOT on (b), (c), or (b, c) alone. The leftmost prefix of the index must be present in the WHERE clause. This is called the leftmost prefix rule.

For a composite index, put the most selective column first when queries filter on individual columns. When queries always filter on a fixed column AND range on another, put the equality column first and the range column second — that way the equality filter narrows the B-tree, and the range scan runs within that narrow set.

Quick reference

Leftmost prefix rule: (a,b,c) index covers WHERE a=1, WHERE a=1 AND b=2, but NOT WHERE b=2 alone.
Equality before range: put = filter columns before BETWEEN / > / < columns in a composite index.
Covering index: include all columns the query reads — avoids heap fetch, enables index-only scan.
INCLUDE clause (Postgres 11+): CREATE INDEX ON orders(user_id) INCLUDE (status, total) for covering index.
Index for ORDER BY: include sort column as last index column to enable index scan instead of filesort.
Use EXPLAIN to verify: look for 'Index Only Scan' (best), 'Index Scan' (good), 'Seq Scan' (investigate).

Before

Wrong column order — index skipped

1-- Index: (status, created_at) — wrong order for this query2CREATE INDEX idx_orders_status_date ON orders(status, created_at);3 4-- This query CAN use the index (starts with status)5SELECT * FROM orders WHERE status = 'pending' AND created_at > '2026-01-01';6 7-- This query CANNOT use the index (skips leading column status)8SELECT * FROM orders WHERE created_at > '2026-01-01';9-- Result: sequential scan — created_at is not the leftmost column

After

Correct column order — index used for both queries

1-- Equality column first, range column second2CREATE INDEX idx_orders_status_date ON orders(status, created_at);3 4-- Query 1: filters on both — uses index efficiently5SELECT * FROM orders6WHERE status = 'pending'7  AND created_at BETWEEN '2026-01-01' AND '2026-02-01';8 9-- For range-only queries on created_at, add a separate index10CREATE INDEX idx_orders_created_at ON orders(created_at);11 12-- Index on (tenant_id, user_id, created_at DESC) — common SaaS pattern13-- Supports: WHERE tenant_id=X AND user_id=Y ORDER BY created_at DESC LIMIT 2014CREATE INDEX idx_orders_tenant_user_date15  ON orders(tenant_id, user_id, created_at DESC);

Remember this

Put equality columns before range columns. Know the leftmost prefix rule. Add INCLUDE columns for covering indexes on hot queries.

Reading EXPLAIN ANALYZE

EXPLAIN ANALYZE is your primary diagnostic tool for slow queries. It shows the query plan the planner chose, the estimated vs actual row counts, and the time spent at each node. The most important numbers: actual rows (too high = bad estimate, bad index), loops (how many times a node ran), and the slowest node by total time.

A huge discrepancy between estimated rows and actual rows indicates stale statistics — run ANALYZE to refresh them. A Seq Scan node on a large table with a small actual row count usually means a missing index or a query that defeats index usage (function on the indexed column, implicit type cast, OR condition).

Quick reference

EXPLAIN shows estimated cost. EXPLAIN ANALYZE actually runs the query and shows real timings.
cost=X..Y: X is startup cost, Y is total cost. Lower is better. Compare across nodes to find bottleneck.
actual time=X..Y: X is time to first row, Y is total time for this node. Highest Y = slowest node.
loops=N: node ran N times. actual time × loops = total time for that node.
Rows Removed by Filter: high numbers here mean missing index or non-sargable predicate.
Buffers: hit=N (from cache), read=N (from disk). Many disk reads → poor cache hit rate or missing index.
Run ANALYZE tablename after bulk inserts to update statistics so the planner makes good decisions.

Before

Slow query — Seq Scan on 2M rows

1EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)2SELECT id, email FROM users3WHERE lower(email) = 'alice@example.com';4 5-- Output:6-- Seq Scan on users  (cost=0.00..58234.00 rows=1 width=36)7--                    (actual time=1234.5..1234.5 rows=1 loops=1)8--   Filter: (lower(email) = 'alice@example.com')9--   Rows Removed by Filter: 199999910-- Planning Time: 0.1 ms11-- Execution Time: 1234.6 ms

After

After expression index — Index Scan on 1 row

1-- Create expression index matching the query predicate2CREATE INDEX idx_users_email_lower ON users(lower(email));3 4-- Re-run explain5EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)6SELECT id, email FROM users7WHERE lower(email) = 'alice@example.com';8 9-- Output:10-- Index Scan using idx_users_email_lower on users11--                    (cost=0.43..8.45 rows=1 width=36)12--                    (actual time=0.1..0.1 rows=1 loops=1)13--   Index Cond: (lower(email) = 'alice@example.com')14-- Planning Time: 0.2 ms15-- Execution Time: 0.2 ms  ← 6000x faster

Remember this

EXPLAIN ANALYZE before and after adding indexes. Look for Seq Scan on large tables, high 'Rows Removed by Filter', and estimated vs actual row count mismatches.

Four Indexing Mistakes That Kill Performance

Over-indexing is a real problem. Every index you add increases INSERT, UPDATE, and DELETE time because each write must update every index on the table. A table with 15 indexes can write 15 times slower than a table with 3. And unused indexes consume disk space, inflate backups, and confuse the query planner.

The four most common mistakes: indexing a low-cardinality column (status with 3 values), never removing unused indexes, using a function on an indexed column in a WHERE clause (defeating the index), and ignoring OR conditions (which split queries across two index scans).

Quick reference

Never add an index 'just in case' — profile first, index second.
Monitor pg_stat_user_indexes.idx_scan — zero scans after a week of traffic = unused index, drop it.
LIKE 'prefix%' uses a B-tree index. LIKE '%suffix' and LIKE '%middle%' do not — use pg_trgm GIN.
OR between different columns often can't use a single index. UNION ALL is frequently faster.
Partial indexes: CREATE INDEX ON orders(user_id) WHERE status = 'active' — much smaller, much faster for active-record queries.
Concurrent index creation: CREATE INDEX CONCURRENTLY — doesn't lock the table, safe for production.

Before

Query patterns that defeat indexes

1-- 1. Function on indexed column — index on email is NOT used2SELECT * FROM users WHERE UPPER(email) = 'ALICE@EXAMPLE.COM';3-- Fix: CREATE INDEX ON users(upper(email)) or use lower() consistently4 5-- 2. Implicit type cast — index on user_id (integer) not used6SELECT * FROM orders WHERE user_id = '123';  -- string '123' != integer 1237-- Fix: match types, or cast explicitly8 9-- 3. LIKE with leading wildcard — B-tree index cannot help10SELECT * FROM products WHERE name LIKE '%widget%';11-- Fix: use GIN + pg_trgm for substring search, or full-text search12 13-- 4. OR condition — may not use index efficiently14SELECT * FROM orders WHERE status = 'pending' OR user_id = 456;15-- Fix: UNION ALL of two indexed queries instead of OR

After

Fixed versions that use indexes

1-- 1. Expression index matches the query predicate exactly2CREATE INDEX ON users(upper(email));3SELECT * FROM users WHERE UPPER(email) = 'ALICE@EXAMPLE.COM'; -- uses index4 5-- 2. Explicit type match6SELECT * FROM orders WHERE user_id = 123;  -- integer literal7 8-- 3. Trigram index for substring search9CREATE EXTENSION IF NOT EXISTS pg_trgm;10CREATE INDEX ON products USING GIN(name gin_trgm_ops);11SELECT * FROM products WHERE name LIKE '%widget%';  -- uses GIN index12 13-- 4. UNION ALL instead of OR14SELECT * FROM orders WHERE status = 'pending'15UNION ALL16SELECT * FROM orders WHERE user_id = 456 AND status != 'pending';17 18-- Find unused indexes (run after steady-state traffic)19SELECT indexrelname, idx_scan20FROM pg_stat_user_indexes21WHERE idx_scan = 0 AND schemaname = 'public'22ORDER BY pg_relation_size(indexrelid) DESC;

Remember this

Remove unused indexes, never use functions on indexed columns in WHERE without a matching expression index, and profile before adding any new index.

Key takeaway

Indexes are a trade between read speed and write overhead. The discipline is adding them intentionally — profile slow queries with EXPLAIN ANALYZE, add the minimum indexes needed to make them fast, and audit for unused ones monthly.

The four rules that cover 90% of indexing decisions: always index foreign keys, put equality columns before range columns in composite indexes, create expression indexes when your WHERE clause uses a function, and drop indexes with zero scans. Get those right and most query performance problems disappear before they reach production.

SQLNoSQL

SQL vs NoSQL: A Developer's Comparison

SQL and NoSQL are not opposites — they solve different data problems. SQL databases store data in normalized tables with…

Read

ScalingSystem Design

Horizontal vs Vertical Scaling

Scaling is how a system handles more users, data, or traffic. Vertical scaling (scale up) means giving your existing ser…

Read

CachingRedis

Caching Strategies: Redis, CDN, and Multi-Level Caches

Caching is the fastest way to scale a system without changing your database or adding servers. Done right, it cuts datab…

Read