cancel
Showing results for 
Search instead for 
Did you mean: 

Index Order Does Matter

PostgreSQL Core Team - EDB

4/12/2017

 

Postgres has supported multi-column indexes since 1997, e.g. create index i_test on test (a, b, c). It can easily use an index if the supplied columns are all at the front of the index, e.g. a and b in the previous index, but it can also use the index if some of the indexed column values are not supplied, e.g. columns a and c in the previous index. It does this by looking up a in the index, then looking through the index for matches of c, ignoring values of b,

 

e.g.

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test
        SELECT x, x, x FROM generate_series(1, 100000) AS f(x);
CREATE INDEX i_test ON test(a, b, c);
ANALYZE test;

EXPLAIN (COSTS false)
SELECT * FROM test WHERE a = 1 AND c = 1;
              QUERY PLAN
--------------------------------------
 Index Only Scan using i_test on test
   Index Cond: ((a = 1) AND (c = 1))

Of course, if possible, you should always put the most commonly supplied columns first in the index because skipping columns during index scans (called "index skip scans") is expensive.

However, if you are supplying all the column values referenced in the index, I assumed it didn't matter what order the columns were specified in the index, but Robert Haas recently mentioned this is not always true. For example, if a has many duplicate values, and b has mostly unique values, having a at the start of the index is suboptimal — better to use b first, which will more effectively narrow the search space when looking for matches on a.

A more specific example of this is range queries. In that case, if the range test of one column is less restrictive than the equality test of another, it would be better for the equality test column to be first in an index. For example, in the queries below, the first query uses an index because it is very restrictive on the first column, the second query uses an index because it is moderately restrictive on the first column, while the third does not use an index because it is effectively unrestrictive on the first column:

EXPLAIN SELECT * FROM test WHERE a = 1 AND b >= 1 AND b <= 100000;
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..8.44 rows=1 width=12)
   Index Cond: ((a = 1) AND (b >= 1) AND (b <= 100000))

EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 50000 AND b = 1;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Index Only Scan using i_test on test  (cost=0.42..1404.10 rows=1 width=12)
   Index Cond: ((a >= 1) AND (a <= 50000) AND (b = 1))

EXPLAIN SELECT * FROM test WHERE a >= 1 AND a <= 100000 AND b = 1;
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on test  (cost=0.00..2291.00 rows=1 width=12)
   Filter: ((a >= 1) AND (a <= 100000) AND (b = 1))

Notice the increasing costs, even though all queries match one indexed row.

Obviously, in cases where you are not specifying all indexed columns in every query, you should put the most frequently referenced columns first in the index to avoid the overhead of index skip scans. However, for cases where most indexed columns are going to be supplied in queries, placing the most restrictive columns first in indexes is a win.

 

Bruce Momjian is Senior Data Architect at EnterpriseDB. 

This post originally appeared on Bruce's personal blog.