cancel
Showing results for 
Search instead for 
Did you mean: 

Parallel index-only scans in PostgreSQL

EDB Team Member

In the previous blog, we saw that parallel index scans leads to significantly improves the performance of quite a few TPC-H queries. It is customary to analyze if its sister operator, namely index-only scan will benefit similarly when parallelized.

 

Before getting into that, we will briefly discuss the utility of an index-only scan. In PostgreSQL, indexes are stored at a different location than the tables. For a regular index scan on a table, first the index created on that table is scanned to find the relevant leaf nodes and then the table is scanned for those locations only. Now, if there is some query in which we need only the values of columns which have an index, then we can scan the index tree only and return the required data, since there is nothing extra that we need to retrieve from that table, that type of scan is called index-only scans. To be precise, index-only scans are a special type of access method which uses index alone and does not require to fetch data from the heap.

 

For example, an index-only scan is likely to show a performance improvement over a regular index scan for the query such as, SELECT count(*) FROM countries WHERE country_area <= <some value>. Assuming we have an index on the column country_area. Here, we can get the tuple information lesser than the required country area by index alone, hence, saving the I/O time to read the tables.

 

The design and implementation of parallel index-only scan is heavily dependent on the machinery developed for scanning B-tree in parallel. There are no new GUC parameters or configuration settings required for this scan.

 

Performance of parallel index-only scan

 

For the industry strength benchmark TPC-H on 300 scale factor, the performance of Q13 is improved by almost 5x with the usage of parallel index-only scan. For this experiment, we used TPC-H inspired benchmark for PostgreSQL.  We used 300 scale factor, which gives 300+ GB of database, depending on the available indexes, etc. Additional indexes we created were on columns (l_shipmode, l_shipdate, o_orderdate, o_comment). We tuned the following parameters,

 

  • random_page_cost = 0.1
  • seq_page_cost = 0.1
  • effective_cache_size = 10GB
  • shared_buffers = 10GB
  • work_mem = 1GB

Q13 Query plan on v9.6

————————————

Limit  

 -> Sort  

      Sort Key: (count(*)) DESC, (count(orders.o_orderkey)) DESC

               ->  HashAggregate  

                    Group Key: count(orders.o_orderkey)

                 ->  GroupAggregate  

                      Group Key: customer.c_custkey

                           ->  Merge Left Join  

                                Merge Cond: (customer.c_custkey = orders.o_custkey)

                                 ->  Index Only Scan using customer_pkey on customer

                                ->  Index Scan using idx_orders_custkey on orders

Execution time: 4146177.735 ms

Q13 Query plan on v10

————————————

Limit  

 -> Sort  

      -> HashAggregate  

         Group Key: count(orders.o_orderkey)

          -> Finalize GroupAggregate

              Group Key: customer.c_custkey

                  -> Gather Merge  

                       -> Partial GroupAggregate  

                             Group Key: customer.c_custkey

                               -> Sort  

                                       ->  Parallel Hash Left Join  

                                            Hash Cond: (customer.c_custkey = orders.o_custkey)

                                     -> Parallel Index Only Scan using customer_pkey on customer

                                     ->  Parallel Hash    

                                        -> Parallel Seq Scan on orders

Execution Time: 739088.785 ms

 

We can see that the query involved aggregations and join over the primary key of customer table. When parallel index-only scan is used, both the aggregations as well as the join could be performed in parallel, hence improving the query performance.

 

I would like to close this discussion with a thank you note to my colleague Amit Kapila who helped and supported me in the due course of this project. Additionally, I would like to thank my employer EnterpriseDB for bestowing me with such an opportunity.