Amazon's DocumentDB, MongoDB, and TechCrunch
Over on TechCrunch, you can find an article posted just yesterday with the inflammatory title AWS gives open source the middle finger. The premise of the article is that, by creating a product which aims to provide compatibility with MongoDB, Amazon is attacking open source. This is a false narrative. MongoDB recently relicensed its code in such a way that it can no longer be used by cloud providers without paying license fees to MongoDB. In essence, then, TechCrunch is claiming that because Amazon is choosing to run code for which they don't have to pay license fees instead of code for which they do have to pay license fees, they're attacking open source.
Well, no. There is plenty of room to debate definitional questions around open source. For example, is open source even the right goal, or should we be aiming for something different, like free software? Is the GPLv3 a step forward from GPLv2, or not a great idea? Wherever you come down on those questions, I think virtually everyone involved with open source or free software would agree that part of the appeal of such software is that you don't have to pay license fees to use it.
Relevantly, it is also not clear whether or not MongoDB's license will meet with the approval of the Open Source Initiative (see relevant email threads about SSPL v1 and SSPL v2). MongoDB is trying to claim that their new license is open source, but if OSI rejects their new license text as incompatible with the Open Source Definition, a lot of people are going to feel that it isn't open source regardless of what MongoDB cares to claim. It's particularly hard to view Amazon's move as an attack on open source when it is no longer even clear that MongoDB is open source. If this move is an attack on anyone, it's an attack on MongoDB, not on open source.
To be clear, I'm not saying here that Amazon isn't attacking open source in general, or that they are, just that this particular claim - which I'm inclined to attribute as much to MongoDB as to TechCrunch - is misleading. If you're looking for a database which is unambiguously open source and, for that matter, not controlled by any single company large or small, you might want to ignore both DocumentDB and MongoDB and check out PostgreSQL, whose license is indeed OSI-approved (more info), and which has just won the DB-Engines Database of the Year award for the second year in a row.
(Article originally published in Robert's personal blog - Thursday, January 10, 2019, @ http://rhaas.blogspot.com/2019/01/amazons-documentdb-mongodb-and.html )
... View more
I admit it: I invented force_parallel_mode. I believed then, and still believe now, that it is valuable for testing purposes. Certainly, testing using force_parallel_mode=on or force_parallel_mode=regress has uncovered many bugs in PostgreSQL's parallel query support that would otherwise have been very difficult to find. At the same time, it's pretty clear that this setting has caused enormous confusion, even among PostgreSQL experts. In fact, in my experience, almost everyone who sets force_parallel_mode is doing so for the wrong reasons.
At least in my experience, wh at typically happens is that some query which a user believes should have used parallel query does not do so. The user then tries to fix the problem by setting force_parallel_mode=on. In reality, this does not solve the problem at all. It only makes things worse. However, it sometimes gives users the impression that they have solved the problem, which may be why people keep doing it (or maybe we need better documentation -- suggestions welcome). Let's taken an example.
rhaas=# explain select * from pgbench_accounts;
Seq Scan on pgbench_accounts (cost=0.00..105574.00 rows=4000000 width=97)
rhaas=# set force_parallel_mode = on;
rhaas=# explain select * from pgbench_accounts;
Gather (cost=1000.00..506574.00 rows=4000000 width=97)
Workers Planned: 1
Single Copy: true
-> Seq Scan on pgbench_accounts (cost=0.00..105574.00 rows=4000000 width=97)
There's a very good reason why the query planner doesn't try to use parallel query here: it won't make the query run faster. Parallel query speeds up queries where the amount of work that must be performed is large compared to the number of output tuples. Here, because there are no joins, no filter conditions, and no aggregates, there is very little work to do per output tuple: the system just needs to fetch each one and send it to the user. Adding parallel workers won't make that any faster. A single process is already very capable of sending tuples to the user as fast as the user can read them. And, even if it isn't, adding more workers won't help. When you do, instead of the main process needing to fetch each tuple from the table, it will need to fetch each tuple from a worker. It's not really doing any less work. And the workers are now using up memory and CPU time, possibly competing with the leader. So adding parallel workers to this plan will actually make it slower.
This is a pretty clear example of a case where you just shouldn't use (or want to use) parallel query, but there are other examples that are less clear-cut. If you encounter one of those cases, you might very reasonably want to convince the planner to try out a parallel plan. After all, the planner can be wrong. Maybe the plan which the planner thinks is cheapest does not use parallel query, but the plan that actually runs the fastest does use parallel query. It can be very valuable to adjust planner settings to try out alternative plans in the situation, but se tting force_parallel_mode is n ot the right way to do it, for a couple of reasons.
First, setting force_parallel_mode will only ever give you 1 parallel worker. Second, setting force_parallel_mode creates a plan where the leader just sits around reading tuples from the worker while the worker does all of the real work. In a regular parallel query plan, all of the processes cooperate, so the leader may help to run the parallel portion of the plan if it's not too busy reading tuples, and the workers also run the parallel portion of the plan. In a plan created by setting force_parallel_mode, this doesn't happen. There's actually a clue right in the plan itself. It says "Single Copy: true". What that means is that only a single copy of the plan beneath the Gather is permitted to be executed. Therefore, the single worker is going to run the plan, and the leader is going to sit around, wait for tuples to show up, and then relay them to the client. This is slower than just having the leader do all the work itself, which is what would have happened if you had not specified force_parallel_mode.
In short, setting force_parallel_mode is useless for the purpose for which most people seem to be trying to use it. It will not make your queries run faster; it will slow them down. The correct way to test out plans that use parallelism more aggressively is to reduce the planner's estimate of how much parallelism costs. By default, parallel_setup_cost=1000 and parallel_tuple_cost=0.1. If you want more of your query plans to run in parallel, reduce these values. Let's see what happens if we try that:
rhaas=# set parallel_setup_cost = 10;
rhaas=# set parallel_tuple_cost = 0.001;
rhaas=# explain select * from pgbench_accounts;
Gather (cost=10.00..86250.67 rows=4000000 width=97)
Workers Planned: 2
-> Parallel Seq Scan on pgbench_accounts (cost=0.00..82240.67 rows=1666667 width=97)
As you can see, the results here are different than when I used force_parallel_mode. First, the plan underneath the Gather is now a Parallel Seq Scan, not just a regular Seq Scan. That means that the plan is safe for multiple workers to execute in parallel. As a result, the Single Copy flag on the Gather has disappeared. This means that all of the workers, as well as the leader, will be able to cooperate in executing the plan, so there is a chance that this plan will be faster, whereas when we set force_parallel_mode, there was no chance. Finally, we now have 2 workers rather than just 1.
In fact, because this query is unsuitable for parallel query, neither approach speeds it up. On my laptop, the first plan (non-parallel) takes 3.26 seconds, the second plan (force_parallel_mode) takes 3.49 seconds, and the third plan (real parallelism) takes 3.55 seconds. Therefore the best option here happens to be what the planner chose originally: no parallelism. Of course, results may vary in other cases, but the thing to remember is that real parallelism is what you want, and force_parallel_mode will not give it to you. If you cannot get real parallelism even after reducing parallel_setup_cost and parallel_tuple_cost, you can also try adjusting min_parallel_table_scan_size and min_parallel_index_scan_size (on 9.6, there is only one setting, and it is called min_parallel_relation_size). If you still can't get parallelism, it might be disabled, or it might not be supported for your query type. See When Can Parallel Query Be Used? in the PostgreSQL documentation for more details.
So, if force_parallel_mode doesn't make queries run faster, what does it do, and when would anyone ever want to use it? I designed force_parallel_mode not as a tool for speeding up queries, but as a tool for finding cases when parallel query doesn't give the right answers. For example, suppose you do this:
rhaas=# create or replace function give10(int) returns void as $$update pgbench_accounts set abalance = abalance + 10 where aid = $1$$ language sql parallel safe;
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
If you are familiar with the limitations of parallel query, you may see that there's a problem here. I have marked the function I created as PARALLEL SAFE, but it does an UPDATE, and parallel query does not support UPDATE. If the planner ever tries to use the give10(int) function inside a parallel query plan, it's going to fail. However, in this simple test case, everything appears to work, because the SELECT query I've used as a test only fetches one row, and therefore the planner chooses an Index Scan and does not attempt to use parallel query.
This situation can be a sort of ticking time bomb. It may seem to work for a while, but the planner might later switch to a different plan, and that new plan may use parallel query where the old one did not, and now things will start to fail. In the worst case, this could result in a production outage. If I test with force_parallel_mode, I will find the problem right away:
rhaas=# set force_parallel_mode = on;
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
ERROR: cannot execute UPDATE during a parallel operation
CONTEXT: SQL function "give10" during startup
Setting force_parallel_mode caused the query planner to place a Single-Copy Gather node on top of the plan it would otherwise have used, which caused give10(int) to execute in the worker, rather than the leader. Then it failed. Now the problem is easy to fix:
rhaas=# alter function give10(int) parallel unsafe;
rhaas=# set force_parallel_mode = on;
rhaas=# select give10(aid) from pgbench_accounts where aid = 10;
Because the function is now (correctly) marked as parallel-unsafe, the planner won't use parallelism even with force_parallel_mode=on. In other words, the value of testing with force_parallel_mode=on is that it may allow you to identify functions that are incorrectly marked as parallel-safe. You can then adjust the markings on those functions and avoid unexpected failures later.
(Article originally published in Robert's personal blog - June 21, 2018 @http://rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html)
... View more
Built-in sharding is something that many people have wanted to see in PostgreSQL for a long time. It would be a gross exaggeration to say that PostgreSQL 11 (due to be released this fall) is capable of real sharding, but it seems pretty clear that the momentum is building. The capabilities already added are independently useful, but I believe that some time in the next few years we're going to reach a tipping point. Indeed, I think in a certain sense we already have. Just a few years ago, there was serious debate about whether PostgreSQL would ever have built-in sharding. Today, the question is about exactly which features are still needed. If you haven't followed progress in this area closely, I highly recommend that you read the Built-in Sharding page which my colleague Bruce Momjian wrote up for the PostgreSQL wiki in December of 2016 as well as the very interesting slides which Etsuro Fujita, Kyotaro Horiguchi, Masahiko Sawada, and Amit Langote presented at PGCONF.ASIA 2016. (Note that the atomic commit feature mentioned in that presentation did not make it into PostgreSQL 11.) One way to look at sharding is as a form of partitioning where the partitions might happen to be foreign tables rather than local tables. From that point of view, the fact that PostgreSQL 11 made huge improvements in the area of partitioning is very significant. Most of those improvements are just as relevant for sharding as they are for partitioning, and some of them are considerably more beneficial for sharding. In particular, hash partitioning is very useful for dividing data evenly across a set of remote servers, and partition-wise aggregate allows the aggregate pushdown capability added in PostgreSQL 10 to work with partitioned tables. One of the most intimidating remaining problems is that the PostgreSQL executor can't run queries against multiple foreign partitions at the same time. Because of all the work that has been done over the last several years on join pushdown, aggregate pushdown, partition-wise join, and partition-wise aggregate, it's possible in PostgreSQL 11devel to get a query plan that looks (approximately) like this: Append
-> Foreign Scan
Relations: Aggregate on orders1 INNER JOIN order_lines1
-> Foreign Scan
Relations: Aggregate on orders2 INNER JOIN order_lines2
-> Foreign Scan
Relations: Aggregate on orders3 INNER JOIN order_lines3 This is a big improvement over what was possible just a few years ago. The join in the original query has been decomposed into joins between the matching pairs of partitions, and likewise the aggregate has been made to happen on a per-partition basis. Then, the join and aggregate has been pushed to the remote side. That is great. What is not so great is that the foreign scans will be executed consecutively, not concurrently. Several attempts have been made to solve this problem, but those designs have either handled only a subset of the important cases or carried a significant risk of slowing down the executor in general. My colleague Andres Freund is now working on this problem and believes that he can solve this problem in a general way while speeding up the executor. Apart from that problem, I think there are basically three major areas that need work: 1. More pushdown. The example above shows a case where we manage to push nearly all of the work to the remote side, but there are plenty of important cases where that won't happen, such as a join of a sharded table to a small local table. Today, we'll pull the remote data to the local node, but we really ought to consider pushing the local table to the remote node (or using a replicated copy that already exists there). Similarly, partition-wise aggregate won't succeed in pushing the aggregate to the remote side unless the partition key matches the grouping key; we should consider performing a partial aggregate on the remote side. Work done over the last several years has made quite a few of the simple cases work, but there is a lot more to do. There is also other performance work which needs to be done, such as speeding up bulk loading of postgres_fdw foreign tables, as well as infrastructure improvements to improve plan quality and planning speed in cases involving foreign tables. 2. ACID. Right now, if a server fails while committing a write transaction that involves multiple nodes (via postgres_fdw or some other foreign data wrapper), you might end up with the transaction committed on some nodes but not others. There's a patch for that, but it didn't make it into PostgreSQL 11. Similarly, we'd really like to have MVCC snapshots that span the entire cluster, so that if you start and commit a transaction T1, a subsequently-started transaction T2 will either see T1 as committed on every node or as not-committed on every node. Several attempts have been made to solve this problem - most recently by Stas Kelvich. I think these are quite difficult problems, and many of the solutions proposed to date have notable disadvantages, but I don't think the problems are intractable. Another, related area that also needs attention is global deadlock detection. 3. Provisioning and Management. It will be nice to be able to set up partitioned tables with foreign partitions, create matching tables on the remote side, configure some sort of global transaction manager to provide MVCC semantics and run queries that are well-optimized and return results quickly. However, getting it all working may be more manual than many users would prefer. Some thought needs to be given to making it easier to set up and configure, and also to issues such as backup and failover in a distributed environment. I don't think it's possible to identify a strict priority order between these items. A few years ago, I took the position that pushdown capabilities were more important than anything else. A system that lacks good provisioning tools and ACID semantics may still get used if it is fast enough; a system that is too slow will not be useful regardless of anything else. In light of the development already completed, though, the situation seems less clear. If you happen to be using queries that are well-handled by the pushdown capabilities in v11, further development in that area may not seem like a high priority, but if not, it may seem like the top priority. Similarly, if you are mostly running queries against static data, or against a data set where there are new bulk loads but no modification of existing data, or if your alternative is a NoSQL solution that doesn't even deliver ACID semantics on a single node, then you might not care very much about ACID semantics across nodes; but if you're running a highly transactional workload that relies on those semantics for correctness, you probably care a great deal. Finally, if you're an expert PostgreSQL DBA, or have some really good tools to help you out, provisioning and management may not seem unreasonably difficult; if you're new to PostgreSQL, or at least to this area, you may have no idea where to start. I don't believe that anyone has a complete vision of what a built-in sharding feature for PostgreSQL should look like just yet. Certainly, I don't. At some point, perhaps a point not too far from now, the project may reach a point where such a vision is absolutely indispensable to further progress. It's not too soon to start thinking about what that vision ought to be. At the same time, even without such a vision, a great deal of progress has been made. I think we are close to achieving critical mass, and I'm looking forward to seeing what the future holds. (Article originally published in Robert's personal blog - May 17, 2018 @http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html#more)
... View more
What if PostgreSQL didn’t need VACUUM at all? This seems hard to imagine. After all, PostgreSQL uses multi-version concurrency control (MVCC), and if you create multiple versions of rows, you have to eventually get rid of the row versions somehow. In PostgreSQL, VACUUM is in charge of making sure that happens, and the autovacuum process is in charge of making sure that happens soon enough. Yet, other schemes are possible, as shown by the fact that not all relational databases handle MVCC in the same way, and there are reasons to believe that PostgreSQL could benefit significantly from adopting a new approach. In fact, many of my colleagues at EnterpriseDB are busy implementing a new approach, and today I’d like to tell you a little bit about what we’re doing and why we’re doing it. While it’s certainly true that VACUUM has significantly improved over the years, there are some problems that are very difficult to solve in the current system structure. Because old row versions and new row versions are stored in the same place - the table, also known as the heap - updating a large number of rows must, at least temporarily, make the heap bigger. Depending on the pattern of updates, it may be impossible to easily shrink the heap again afterwards. For example, imagine loading a large number of rows into a table and then updating half of the rows in each block. The table size must grow by 50% to accommodate the new row versions. When VACUUM removes the old versions of those rows, the original table blocks are now all 50% full. That space is available for new row versions, but there is no easy way to move the rows from the new newly-added blocks back to the old half-full blocks: you can use VACUUM FULL or you can use third-party tools like pg_repack, but either way you end up rewriting the whole table. Proposals have been made to try to relocate rows on the fly, but it’s hard to do correctly and risks bloating the indexes, since each row moved requires a new entry in each index to point to the new location of the row. If heap bloat is caused by a single gigantic update, it can often be avoided by breaking down the large update into a series of smaller updates, running VACUUM in between or giving autovacuum a chance to do so. In this way, old row versions from the first update are reclaimed and become free space that can be reused for the second update, and so bloat is reduced. However, there are some access patterns where the table becomes bloated not because of one big update but from many small updates occurring over a long period of time. A simple example is to open a transaction which does a single-row UPDATE and then remains idle for a long time; meanwhile, other transactions continue to do writes to the database, large or small. Whichever tables are being frequently modified will bloat, and once again, there’s no easy way to shrink them again afterwards. To be clear, I’m not saying that it’s a particularly good idea to open a write transaction and then leave the session idle for a long time; typically, when this happens, it’s the result of a poorly written client application that forgot it had a transaction open. Moreover, any relational database is going to suffer under such a workload. In my view, the problem isn’t so much that PostgreSQL can’t cope with such a situation gracefully - that would be too much to expect - but that it’s potentially quite painful to recover afterward. Long-running reporting queries can create similar problems. To put this another way, it is in general true that PostgreSQL’s VACUUM implementation has gotten progressively better at reclaiming space occupied by dead tuples more quickly and with less expenditure of effort. And that’s really good, because the faster you reclaim space, the less new space you end up allocating, which keeps tables small and performance high. However, the examples above show that VACUUM isn’t the whole problem. In these examples, even if VACUUM ran at the earliest instant when it could reclaim the space occupied by dead tuples and ran infinitely fast, the table would still become bloated. In the case where the bloat is caused by many short queries run while one long-running transaction remains open, we could, with smarter snapshot management, limit the worst-case bloat to approximately a factor of two -- that is, we’d keep the version of the tuple visible to the old snapshot and the current version, and discard the intermediate versions, a trick PostgreSQL currently can’t manage. However, even a factor of two is a lot, and what if there are multiple distinct open snapshots? Further, in the case where the bloat is created by a SQL statement that induces scattered updates throughout the table, no improvement to VACUUM can possibly help. By the time that SQL statement finishes, the damage is already done. In some sense, therefore, blaming bloat on deficiencies of VACUUM is like blaming your local garbage collector for the fact that your house is constantly surrounded by dozens of completely full trash barrels. In such a situation, it might be true that the garbage collector should come around a bit more frequently or work a little faster or harder, but maybe part of the problem is YOU. Unless your trash service is exceptionally bad, to have such a large amount of garbage around all the time, you must be generating it at an enormous rate. If you stop throwing so much stuff away, it won’t matter as much how often the garbage collector comes around. It might also help if you put all of the trash in one giant dumpster instead of many separate barrels strewn hither and yon. The problems in this area stem largely from the fact that PostgreSQL is unable to do a true, transactional UPDATE in place. Under the hood, UPDATE looks a lot like a DELETE plus an INSERT. If your table contains no free space and you execute a transaction that updates one tuple, then there are two things that are definitely true, regardless of whether the UPDATE transaction commits or aborts. The first is that the table must grow in order to accommodate the new row version, and the second is that we end up with a dead row version - either the old version ends up dead, if the transaction commits, or the new version becomes dead immediately, if the transaction aborts. Either way, the table is now a tiny bit bloated, and either way, there’s now work for VACUUM to do. This system is very much symmetric. A transaction that commits generates pretty much the same amount of work as a transaction that aborts. This is very elegant, but it’s not best in practice, because very few people run workloads where anywhere close to 50% of transactions abort. (Anyone who has such a workload will, I suspect, find that PostgreSQL handily outperforms the competition.) It would be better to have a system where we try to make commits cheaper, and aborts more expensive. That brings me to the design which EnterpriseDB is proposing. We are working to build a new table storage format for PostgreSQL, which we’re calling zheap. In a zheap, whenever possible, we handle an UPDATE by moving the old row version to an undo log, and putting the new row version in the place previously occupied by the old one. If the transaction aborts, we retrieve the old row version from undo and put it back in the original location; if a concurrent transaction needs to see the old row version, it can find it in undo. Of course, this doesn’t work when the block is full and the row is getting wider, and there are some other problem cases as well, but it covers many useful cases. In the typical case, therefore, even bulk updates do not force a zheap to grow. Instead, the undo grows. When a transaction commits, all row versions that will become dead are in the undo, not the zheap. This means that there is no need for VACUUM, or any similar process, to scan the table looking for dead rows. The only time the table contains dead rows is when a transaction aborts, and in that case we immediately use the undo to go put the old, still-living row versions back where they were. That process is targeted and does not involve needing to scan the entire table. When a transaction commits, we can eventually discard the undo it generated; this is a bulk operation and can be done very quickly. Handling indexes is more complex, but we believe that we can also eliminate the need for index vacuuming using much the undo infrastructure. That topic, however, would make this blog post far longer than it already is, so I will leave it for another time. There is also a great deal more detail about the design of zheap which I would like to write about, but that, too, will need to wait for another time. This post is intended to explain why we have undertaken this work, not exactly what we are doing. I realize that there is likely to be a good deal of skepticism among experienced PostgreSQL hackers about the feasibility of making this approach work. I do not claim that we have solved all of the problems, nor that success is assured. A huge amount of work remains to be done. Even if all of that work is successfully completed and even if all of the remaining problems are solved, there will probably still be cases where the existing heap outperforms zheap. That having been said, we have built enough of this new system that parts of it can be tested against the current heap implementation, and those test results look promising. In the near future, we will release the code we have so far under the PostgreSQL license so that the whole PostgreSQL community can look at it, form their own opinions, and run their own tests. Although I did much of the basic design work, the development lead for this project is Amit Kapila, who has been assisted by Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, and Beena Emerson. Thomas Munro wrote the undo storage system. Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism). Thanks to all of them. (Article originally published in Robert's personal blog - January 30, 2018 @http://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html)
... View more
In a recent blog post , I talked about why every system that implements MVCC needs some scheme for removing old row versions, and how VACUUM meets that need for PostgreSQL. In this post, I’d like to examine the history of VACUUM improvements in recent years, the state of VACUUM as it exists in PostgreSQL today, and how it might be improved in the future. When I first began using PostgreSQL, autovacuum did not exist, and I was not aware of the need for manual VACUUM. After a few months (!), I wondered why my database was so slow. Putting a vacuumdb command in cron, scheduled to run every 6 hours, was sufficient for my needs at the time, but it only worked because my database was small and handled a limited amount of traffic. In many environments, UPDATE and DELETE operations will target some tables much more often than others, and therefore some tables will accumulate dead row versions much more quickly than others, and therefore the interval between one VACUUM and the next should also vary. If a user with this kind of environment were to run a full-database VACUUM frequently enough to meet the needs of their most heavily-updated tables, they would VACUUM lightly-updated tables far more often than necessary, wasting effort. If they were to reduce the frequency of the full-database VACUUM to avoid wasting effort, heavily updated tables wouldn’t get vacuumed often enough and their on-disk size would grow as they filled up with dead row versions, otherwise known as “bloat”. PostgreSQL 8.3 was the first release to feature a reasonably modern autovacuum. For the first time, autovacuum was enabled by default. It also featured a multi-process architecture, which meant that more than one table could be automatically vacuumed at the same time. And, of course, autovacuum takes care of figuring out which tables are due to be vacuumed, largely (though not completely) eliminating the need to schedule vacuums manually. This was a huge step forward; for the first time, not every user of PostgreSQL needed to worry about configuring VACUUM when they configured PostgreSQL. On a good day, it worked perfectly out of the box. PostgreSQL 8.4 featured two massive further improvements. In older releases, there was a fixed-size free space map which could be resized only by changing a configuration parameter and restarting the server. If the number of pages with free space in the database exceeded the configured size of the free space map, the server would lose track of some pages containing free space, typically causing runaway database bloat. PostgreSQL 8.4 introduced a new, dynamically sized free space map which never lost track of free space. It also added a visibility map, which allowed VACUUM to avoid scanning unchanged portions of the table all over again during each new VACUUM. It was, however, still necessary to scan the entirety of every index every time, and it was periodically necessary to scan the whole table in order to prevent transaction ID wraparound. VACUUM for the purpose of preventing transaction ID wraparound is typically rare compared to ordinary VACUUM; the former is triggered, by default, every 200 million write transactions, while the latter is triggered, by default, when the number of dead tuples in the table exceeds 500 + 20% of the estimated table number of tuples in the table. In each case, these values can be changed by adjusting configuration parameters. Several of the improvements since PostgreSQL 8.4 have focused on trying to prevent VACUUM from getting stuck . The basic problem, which exists in multiple variants, is that if an autovacuum process waits for some lock which doesn’t become available for a long time, then (1) the table that it is trying to VACUUM doesn’t get vacuumed by that process or any other, since only one process can vacuum a table at a time and (2) there is one fewer autovacuum worker available to run VACUUM on any other table in the system, which may mean that other tables don’t get promptly vacuumed either. Both problems can lead to bloat. PostgreSQL 9.1 improved things a bit by teaching autovacuum workers to skip tables that were due for vacuuming but on which they couldn’t immediately get a relation lock; since autovacuum retries every minute, this doesn’t hurt anything. PostgreSQL 9.2 improved things so that the system skipped individual table blocks on which it could not immediately obtain a cleanup lock, except for cases where the block contained tuples that needed to be removed or frozen and the vacuum was for the purpose of preventing transaction ID wraparound. PostgreSQL 9.5 reduced the number of cases in which btree index scans retain a pin on the last-accessed index page, which eliminates most cases of VACUUM getting stuck waiting for an index scan. As this list of improvements makes clear, the problem of index scans getting stuck has not been entirely solved, but we have steadily whittled away at the list of cases where it happens. There have also been a few improvements that aim to reduce the number of times that VACUUM scans each heap page. Prior to the PostgreSQL 9.3 , a table page that was not modified between one VACUUM and the next would be marked all-visible by the second VACUUM, and skipped by future VACUUMs, except those triggered to prevent transaction ID wraparound. In PostgreSQL 9.6 , it became possible to skip pages even during a VACUUM triggered to prevent transaction ID wraparound. To do this, the visibility map was further improved to keep track not only of whether pages were all-visible -- that is, known not to contain any dead row versions -- but also whether they were all-frozen -- that is, known not to contain any tuples at risk of eventual transaction ID wraparound. Pages in the latter category can be skipped unconditionally; they cannot be interesting to VACUUM for any purpose. In addition to what is mentioned above, there has been a steady stream of logging improvements and a few other efficiency improvements over the years. What is left to be done? The PostgreSQL development community has made great progress in reducing the degree to which VACUUM performs unnecessary scans of table pages, but basically no progress at all in avoiding unnecessary scanning of index pages. For instance, even a VACUUM which finds no dead row versions will still scan btree indexes to recycle empty pages. Efforts are underway to improve this , but they have stumbled over some details of how btree indexes interact with transaction ID wraparound. It would also be desirable to improve the behavior in the case where there are a few, but not very many, dead row versions. If a 1TB table contains 10 dead row versions, especially if they are all on the same page, scanning the entirety of every index in order to remove index pointers for those row versions likely doesn’t make sense. This problem is, however, mitigated by the fact that autovacuum generally won’t trigger in the first place in cases where there are only a few dead row versions. VACUUM remains a bulk operation. In some ways, this is good, because batching large numbers of related operations together into a single bulk operation tends to be more efficient than performing them one by one. However, it also means that when VACUUM engages, especially after a long period of time during which it has not run, the resource utilization can come as a surprise, and on large tables, it can last for a long time. It seems worth thinking about whether the VACUUM of a large table could be split into a series of smaller steps. The default settings for cost-based VACUUM delay are too small for large databases. By default, autovacuum runs with a 20ms cost-delay, and the default settings of vacuum_cost_limit, vacuum_cost_page_dirty, and vacuum_cost_page_miss limit VACUUM to about 28.8 GB/hour when scanning table or index pages that do not need cleaning and about 9.6 GB/hour when scanning table or index pages all of which need cleaning. For tables that are hundreds of gigabytes or terabytes in size, this can make VACUUM take too long; if the table is due to be vacuumed again before VACUUM completes, it will bloat. Also, VACUUM has no ability to tailor its activity based on system busy and idle periods, because it has no knowledge of when those periods occur. Ideally, a system that is otherwise idle would perform any vacuuming that is in progress, or even due soon, at maximum speed, and then proceed more slowly when the system is under a foreground load. However, there’s no scheduling system to make such a thing happen; if you want this type of behavior, you must configure it manually. If I were to summarize all of the above in one sentence, it might be this: VACUUM has come a long way, but it still needs monitoring and management. Small installations - like the first ones I used - running a current version of PostgreSQL may never need to do anything about VACUUM at all, and even larger installations will typically experience fewer problems today than they would have with earlier releases. At the same time, for large installations requiring consistently good performance, VACUUM typically requires at least some monitoring, management, and tuning. Time will tell what improvements future releases will bring.
... View more
Experienced PostgreSQL users and developers rattle off the terms “MVCC” and “VACUUM” as if everyone should know what they are and how they work, but in fact many people don’t. This blog post is my attempt to explain what MVCC is and why PostgreSQL uses it, what VACUUM is and how it works, and why we need VACUUM to implement MVCC. In addition, I’ll include a few useful links for further reading for those who may want to know more. PostgreSQL is a relational database and therefore attempts to provide transactions which have the “ACID” properties: atomicity, consistency, isolation, and durability. Let’s ignore consistency and durability for the moment. Atomicity means that transactions are “all or nothing”; either the transaction commits and all of its changes take effect, or it aborts and none of the changes take effect. Isolation means that each transaction should be unaware that there are other transactions executing concurrently. For example, if transaction A updates two rows with a = 1 to have a = 2 and concurrent transaction B deletes all rows with a = 2, it should never happen that one of the two rows updated by A gets deleted and the other does not. Either A “happens first” and thus both rows are deleted, or B “happens first” and thus neither row is deleted. Taken together, this means that the database would like to give the user the impression that no concurrent changes are taking place while their transaction is in progress, and that when their transaction commits, all of the changes it made become instantaneously and simultaneously visible to everyone. Relational databases in general - not PostgreSQL specifically - have settled on two ways of doing this. The first is two-phase locking: a transaction takes a shared lock on every bit of data it reads and an exclusive lock on every bit of data that it writes. When it commits, it releases those locks; if it aborts, it reverses the changes that it made and then releases locks. Because any modified data is protected by an exclusive lock, no concurrent transaction will see that change until it is committed. Meanwhile, readers can coexist, since their locks are all shared. Unfortunately, when data is both read and written, this approach has poor concurrency: readers and writers block each other, because the shared locks taken by readers conflict with the exclusive locks taken by writers. Locks are bad for concurrency, and two-phase locking requires exclusive locks (and shared locks, for full correctness) to be held for the entire duration of the transaction. The second approach to providing transactions with atomicity and isolation is multi-version concurrency control (MVCC). The basic idea is simple: instead of locking a row that we want to update, let’s just create a new version of it which, initially, is visible only to the transaction which created it. Once the updating transaction commits, we’ll make the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. That way, every transaction sees what is in essence a consistent snapshot of the database. Finally, when there are no transactions remaining that can see the old version of the row, we can discard it. This approach has much better concurrency than two-phase locking, which is why it is used by many modern database systems. However, we’ve exchanged our concurrency problem for a space management problem. With two-phase locking, a row that gets updated can be updated in place; because the updating transaction holds an exclusive lock, nobody else can see the modifications until the lock is released. With MVCC, this won’t work, since we need to keep both versions at least for a short while (or perhaps for a long while, if there are long-running concurrent transactions). Either the old row version must be saved elsewhere, and the new version put in its place, or the old row version must be left untouched, and the new version stored elsewhere. This is an area where different database systems have adopted different approaches. PostgreSQL’s approach to this problem is to store new row versions created by UPDATE in basically the same way that it would store a completely new row generated by an INSERT. Once an UPDATE commits, future SQL statements (or future transactions, depending on your choice of isolation level) will see the new row versions, and the old ones will be visible only to statements (or transactions) which are already running. Eventually, there will no longer be anything running which can see the old row versions, at which point they are “dead”. Similarly, when a DELETE commits, there will eventually be nothing running which can see the old row versions, at which point they are likewise “dead”. Similarly, if an INSERT aborts, the new row versions which it inserted go from being visible only to the inserting transaction to being visible to nobody at all, and are thus “dead”, and likewise when an UPDATE aborts, any new row versions which it created are “dead”. The Heroku Dev Center has a well-written article explaining some details of how this works under the hood. And that brings us to the need for VACUUM. Any system which uses MVCC to implement transaction isolation must have some way of getting rid of dead row versions. If it did not, the number of dead row versions would grow without bound, and therefore the database size would grow without bound, which would be bad. At some point, we must get rid of the dead row versions so that the space can be reused. In PostgreSQL, this is VACUUM’s job. Some other systems use different cleanup mechanisms for different kinds of dead row versions; for example, dead row versions created by aborted transactions may be handled differently from dead row versions created by committed transactions. In PostgreSQL, we handle of these in the same way. It bears mentioning that in certain cases, dead row versions can be cleaned up using a page-at-a-time mechanism called HOT pruning (for technical details, see README.HOT). These cases are actually quite common; however, HOT pruning is a “best effort” strategy, and VACUUM is the backstop that makes sure the work gets done and completes whatever cleanup can’t be done by HOT pruning. (This brief explanation omits many interesting details, but this post is already getting quite long.) So, here’s how VACUUM works. First, it scans every page in the table (also known as the “heap”) that might possibly contain dead row versions. A data structure called the visibility map keeps track of which pages have been modified since the last VACUUM; those that have not been may be skipped. As it does, it removes dead row versions from those pages and makes the space occupied by those tuples available for reuse. During this process, dead row versions are replaced with a sort of tombstone - in technical terms, the line pointer that pointed to the row version which was removed is marked LP_DEAD. If no dead tuples are found during this stage, VACUUM stops. Otherwise, it scans every index on the table and removes all index entries which point to the dead line pointers identified during the first phase. Once this phase is complete, it goes back to the table and removes the tombstones - in technical terms, the LP_DEAD line pointers are marked LP_UNUSED. Once this is done, those line pointers can be reused for new tuples. If those line pointers were reused before the index entries were removed, we would potentially end up with index entries that don’t match the row versions to which they point, which could cause queries to return wrong answers. The previous paragraph used the term “line pointer”, but did not define it. Each page begins with an array of line pointers which identify the starting position of each tuple on the page, and the length of that tuple. There are a few special kinds of line pointers, including LP_UNUSED (which means that the array slot is available to be used for a new tuple), LP_DEAD (which, as noted above, means that the tuple has been removed from the page but the line pointer slot isn’t available to be reused yet), and LP_REDIRECT (which is related to HOT, but beyond the scope of this blog post). In short, VACUUM scans the parts of the table that have changed since the last VACUUM, then scans the indexes in their entirety, then scans the changed parts of the table again. By doing so, it ensures that dead row versions and the index entries which referenced them are removed, and that the line pointers which referenced those dead row versions are made available for reuse. Some final notes: 1. An ordinary VACUUM is very different from VACUUM FULL, which actually rewrites the whole table.
2. In addition to its function in removing dead tuples, VACUUM also prevents transaction ID wraparound, updates statistics used by the query planner, and updates the visibility map. Updating the visibility map is important not only to make future VACUUM operations more efficient, but also to make index-only scans work better. This is all covered in the documentation. 3. Generally, you don’t need to run VACUUM by hand, because the autovacuum daemon will arrange to run it automatically. However, there are definitely times when it’s useful to run VACUUM by hand. This can be a good idea, for example, after a bulk load, or during a low-usage time of day, to reduce the need for later vacuuming during high-usage periods.
(Article originally published on Roberts's blog on Monday, December 18, 2017 @ http://rhaas.blogspot.com/2017/12/mvcc-and-vacuum.html#more)
... View more
Tuesday, September 26, 2017
Since I just committed the last pending patch to improve hash indexes to PostgreSQL 11, and since most of the improvements to hash indexes were committed to PostgreSQL 10 which is expected to be released next week, it seems like a good time for a brief review of all the work that has been done over the last 18 months or so. Prior to version 10, hash indexes didn't perform well under concurrency, lacked write-ahead logging and thus were not safe in the face either of crashes or of replication, and were in various other ways second-class citizens. In PostgreSQL 10, this is largely fixed.
While I was involved in some of the design, credit for the hash index improvements goes first and foremost to my colleague Amit Kapila, whose blog entry on this topic is worth reading . The problem with hash indexes wasn't simply that nobody had bothered to write the code for write-ahead logging, but that the code was not structured in a way that made it possible to add write-ahead logging that would actually work correctly. To split a bucket, the system would lock the existing bucket (using a fairly inefficient locking mechanism), move half the tuples to the new bucket, compact the existing bucket, and release the lock. Even if the individual changes had been logged, a crash at the wrong time could have left the index in a corrupted state. So, Amit's first step was to redesign the locking mechanism. The new mechanism allows scans and splits to proceed in parallel to some degree, and allows a split interrupted by an error or crash to be completed at a later time. Once that was done a bunch of resulting bugs fixed, and some refactoring work completed, another patch from Amit added write-ahead logging support for hash indexes. In the meantime, it was discovered that hash indexes had missed out on many fairly obvious performance improvements which had been applied to btree over the years. Because hash indexes had no support for write-ahead logging, and because the old locking mechanism was so ponderous, there wasn't much motivation to make other performance improvements either. However, this meant that if hash indexes were to become a really useful technology, there was more to do than just add write-ahead logging. PostgreSQL's index access method abstraction layer allows indexes to retain a backend-private cache of information about the index so that the index itself need not be repeatedly consulted to obtain relevant metadata. btree and spgist indexes were using this mechanism, but hash indexes were not, so my colleague Mithun Cy wrote a patch to cache the hash index's metapage using this mechanism . Similarly, btree indexes have an optimization called "single page vacuum" which opportunistically removes dead index pointers from index pages, preventing a huge amount of index bloat which would otherwise occur. My colleague Ashutosh Sharma wrote a patch to port this logic over to hash indexes , dramatically reducing index bloat there as well. Finally, btree indexes have since 2006 had a feature to avoid locking and unlocking the same index page repeatedly -- instead, all tuples are slurped from the page in one shot and then returned one at a time. Ashutosh Sharma also ported this logic to hash indexes , but that optimization didn't make it into v10 for lack of time. Of everything mentioned in this blog entry, this is the only improvement that won't show up until v11. One of the more interesting aspects of the hash index work was the difficulty of determining whether the behavior was in fact correct. Changes to locking behavior may fail only under heavy concurrency, while a bug in write-ahead logging will probably only manifest in the case of crash recovery. Furthermore, in each case, the problems may be subtle. It's not enough that things run without crashing; they must also produce the right answer in all cases, and this can be difficult to verify. To assist in that task, my colleague Kuntal Ghosh followed up on work initially begun by Heikki Linnakangas and Michael Paquier and produced a WAL consistency checker that could be used not only as a private patch for developer testing but actually committed to PostgreSQL . The write-ahead logging code for hash indexes was extensively tested using this tool prior to commit, and it was very successful at finding bugs. The tool is not limited to hash indexes, though: it can be used to validate the write-ahead logging code for other modules as well, including the heap, all index AMs we have today, and other things that are developed in the future. In fact, it already succeeded in finding a bug in BRIN . While wal_consistency_checking is primarily a developer tool -- though it is suitable for use by users as well if a bug is suspected -- upgrades were also made to several tools intended for DBAs. Jesper Pedersen wrote a patch to upgrade the pageinspect contrib module with support for hash indexes , on which Ashutosh Sharma did further work and to which Peter Eisentraut contributed test cases (which was a really good idea, since those test cases promptly failed, provoking several rounds of bug-fixing). The pgstattuple contrib module also got support for hash indexes , due to work by Ashutosh Sharma. Along the way, there were a few other performance improvements as well. One thing that I had not realized at the outset is that when a hash index begins a new round of bucket splits, the size on disk tended to abruptly double, which is not really a problem for a 1MB index but is a bit unfortunate if you happen to have a 64GB index. Mithun Cy addressed this problem to a degree by writing a patch to allow the doubling to be divided into four stages , meaning that we'll go from 64GB to 80GB to 96GB to 112GB to 128GB instead of going from 64GB to 128GB in one shot. This could be improved further, but it would require deeper restructuring of the on-disk format and some careful thinking about the effects on lookup performance. A report from a tester who goes by "AP" in July tipped us off to the need for a few further tweaks. AP found that trying to insert 2 billion rows into a newly-created hash index was causing an error. To address that problem, Amit modified the bucket split code to attempt a cleanup of the old bucket just after each split , which greatly reduced the accumulation of overflow pages. Just to be sure, Amit and I also increased the maximum number of bitmap pages , which are used to track overflow page allocations, by a factor of four. While there's always more that can be done , I feel that my colleagues and I -- with help from others in the PostgreSQL community -- have accomplished our goal of making hash indexes into a first-class feature rather than a badly neglected half-feature. One may well ask, however, what the use case for that feature may be. The blog entry from Amit to which I referred (and linked) at the beginning of this post shows that even for a pgbench workload, it's possible for a hash index to outperform btree at both low and high levels of concurrency. However, in some sense, that's really a worst case. One of the selling points of hash indexes is that the index stores the hash value, not the actual indexed value - so I expect that the improvements will be larger for wide keys, such as UUIDs or long strings. They will likely to do better on read-heavy workloads, as we have not optimized writes to the same degree as reads, but I would encourage anyone who is interested in this technology to try it out and post results to the mailing list (or send private email), because the real key for a feature like this is not what some developer thinks will happen in the lab but what actually does happen in the field. In closing, I'd like to thank Jeff Janes and Jesper Pedersen for their invaluable testing work, both related to this project and in general. Getting a project of this magnitude correct is not simple, and having persistent testers who are determined to break whatever can be broken is a great help. Others not already mentioned who deserve credit for testing, review, and general help of various sorts include Andreas Seltenreich, Dilip Kumar, Tushar Ahuja, Álvaro Herrera, Michael Paquier, Mark Kirkwood, Tom Lane, and Kyotaro Horiguchi. Thank you, and thanks as well to anyone whose work should have been mentioned here but which I have inadvertently omitted.
(Article originally published on Roberts's blog - Tuesday, September 26, 2017 @ http://rhaas.blogspot.com/2017/09/postgresqls-hash-indexes-are-now-cool.html#more)
... View more
For questions 1-3, Vibhor Kumar has posted a blog some time back on encryption in PostgreSQL, found here: https://www.enterprisedb.com/blog/postgres-and-transparent-data-encryption-tde. Regarding questions 4-6, both PostgreSQL and EDB Postgres will automatically negotiate the highest TLS version that is supported both by the OpenSSL library against which the server is linked and the client. Therefore, TLS 1.2 should be used automatically. There is no configuration to explicitly choose the TLS version, although you can restrict particular ciphers by changing the ssl_ciphers setting in postgresql.conf.
... View more
Barring unexpected problems, PostgreSQL 10 is expected to be released this Thursday (October 5th). Range and list partitioning are supported, but not hash partitioning, interval partitioning, or reference partitioning. Hash partitioning is however supported in Advanced Server and has been for several years.
... View more
You mention that you are "unable to vaccum this table alone". What happens when you try? Plain old VACUUM shouldn't require any extra disk space, so the fact that the amount of free disk space remaining is smaller than the table shouldn't be a problem. VACUUM FULL does require extra disk space, but see http://rhaas.blogspot.com/2014/03/vacuum-full-doesnt-mean-vacuum-but.html If you need to use VACUUM FULL but don't have enough disk space for the operation to complete, one thing you could try doing is removing all of the table's indexes before running VACUUM FULL, and then recreate them after the operation is done. If that still doesn't work, then you may need to consider adding more disk space. There is also a utility called pg_repack available, but EnterpriseDB does not support it.
... View more
Monday, August 14, 2017 The reaction to the new table partitioning feature in PostgreSQL 10 has been overwhelmingly positive, but a number of people have already astutely observed that there is plenty of room for improvement. PostgreSQL 10, already at beta3, will go GA some time in the next month or two, and presumably once it does, more people will try out the new feature and find things they would like to see improved. The good news is that there is already a substantial list of people, many of them my colleagues at EnterpriseDB, working on various improvements. The following list of projects is complete to my knowledge, but of course there may be projects of which I'm unaware especially at companies where I don't work. If you're working on something else, please leave a comment about it! Amit Langote (NTT), the original author of the table partitioning feature, has just recently posted an important series of patches to expand inheritance hierarchies in bound order (hereinafter, EIBO). There are only minimal directly user-visible effects of this change: EXPLAIN output may be ordered differently, and results may show up in a different order (unless constrained using an ORDER BY clause). However, this change will make it much easier to implement various other desirable partitioning features, as will become clear from the discussion that follows. One such change is to allow bound-based partitioning pruning, a feature for which I understand that Amit is planning to post a patch in the near future. Currently, partitions are eliminated one-by-one by comparing the partition constraint (and any other constraints) against the query to determine which partitions can't possibly contain any matching rows. This means that the time to prune partitions is proportional to the number of partitions. With this feature, we'll be able to pick the exact partitions we want to keep and discard the rest in bulk, making this (hopefully) a lot faster. Amit has also submitted a number of other patches around partitioning. One of these will re-enable INSERT .. ON CONFLICT DO NOTHING for partitioned tables; that was briefly enabled during v10 development but didn't end up enabled in the final version. Another, to which Ashutosh Bapat also contributed, will improve \d+ output for partitioned tables. Beena Emerson (EnterpriseDB) is currently developing a prototype patch for execution-time partition pruning. Currently, partitions are pruned only at plan time, which means that when the value of the partitioning key isn't known until execution time, no pruning is possible. For example, consider a nested-loop join between two partitioned tables where the inner table is index-scanned using a value taken from the outer table; suppose the indexed column is also the partition key. Right now, we'll index-scan every partition for matches, even though there's only one partition that could conceivably contain a match. Index scans returning no rows are pretty fast, but skipping them altogether should be a lot faster, especially when there are many partitions. Prepared queries using generic plans will also benefit; in such cases, the planner doesn't know what value will be supplied and therefore can't prune, but at execution time the value will be known and pruning will be possible. Since the idea is to make run-time partition pruning bound-based, it will require EIBO. Amit Khandekar (EnterpriseDB) has written a patch to allow updates that move rows between partitions, which I think will be an extremely popular feature. Essentially, such updates will be broken into a DELETE operation followed by an INSERT operation. Unfortunately, at least at the READ COMMITTED isolation level, this will create new kinds of serialization anomalies that don't exist at present. This can't be completely prevented because the EvalPlanQual mechanism which PostgreSQL uses in such cases depends on each tuple in an update chain pointing to the next-newer version of that tuple using a block number and offset within the block, and there's no space to store the ID of another relation entirely in the on-disk format. However, what we can do is detect the problem and throw an error; one of my colleagues will tackle that problem if nobody else beats us to it. This patch also depends on EIBO getting committed. Amul Sul (EnterpriseDB) has written a patch to allow hash partitioning. Initial discussion of this patch revealed that the PostgreSQL community would like to see a number of other things done first. For example, bound-based partition pruning should be committed first, because constraint exclusion won't work for hash partition constraints, and we want hash partitioning to have working partition pruning. Another problem is that the hash functions which PostgreSQL uses might not always be completely portable. Integers, for example, hash differently on big-endian and little-endian platforms, and strings might hash differently under different encodings. To help work around this problem, Rushabh Lathia (EnterpriseDB) has written a patch for pg_dump that makes it possible to dump all the partitions in such a way that the data will be reloaded into the root of the partitioning hierarchy. This will allow tuple-routing to restore that row into the partition to which it now belongs rather than the one to which it belonged before. This isn't only useful once we have hash partitioning: it could also happen with range partitioning on a text column if the collation definitions are different on the source and target systems. The problem will probably be more likely to arise with hash partitioning, though. There are a few other problems, too. Right now, the proposed patch for hash partitioning uses the same hash functions that we use for other purposes, such as hash indexes, hash joins, and hash aggregates. However, it would be better to use a different function, because otherwise a hash index built on a hash partition will have mostly-empty buckets; and similar problems can occur in other cases where hash functions are used. If we're installing new hash functions anyway, we might want to consider making them faster or more portable, or define operator families that are consistent across more types. Jeff Davis (whose corporate affiliation on PostgreSQL's contributors page appears to be out of date) was the first to point out these problems, and may do some work on them. I am thinking about them as well. Another frequently-requested feature is default partitioning. Rahila Syed (EnterpriseDB) and Jeevan Ladhe (EnterpriseDB) have written a patch to implement default partitioning for list-partitioned tables, and Beena Emerson has written a patch which extends this to range-partitioned tables. A default partition accepts all rows that don't map to any other partition. Note that a range-partitioned table may have multiple ranges not covered by any other partition, and a default partition will accept rows that fall into any of those ranges; this feature couldn't exist in the form proposed here on databases that don't allow "gaps" in the key space not covered by any partition. Maksim Milyutin (Postgres Pro) has started to tackle the important problem of defining indexes on partitioned tables. In v10, while individual partitions can have indexes, there is no such thing as an index on the partitioned table itself. However, it would be convenient for users to be able to define an index on the partitioned table and have it cascade down to all the partitions, creating a matching index on each one. That way, new partitions would automatically end up with the correct indexes, and generally a lot of typing would be saved. While foreign tables can be partitions in v10, they are currently subject to an unfortunate limitation: tuple inserts that route to a foreign table will fail, since this feature has not been implemented yet. Etsuro Fujita (NTT) is working to address this limitation by revising code originally written by Amit Langote. Ashutosh Bapat (EnterpriseDB) is continuing to work on partition-wise join, which was originally submitted for PostgreSQL 10 but wasn't ready to commit as of the feature freeze date. This feature allows a join between two compatibly-partitioned tables to be done partition-by-partition, which can sometimes be a lot faster. The initial version of this feature is likely to leave out a lot of important things, such as optimizations to save memory, and being able to interleave partitioned and unpartitioned joins. Both Ashutosh and I would like to see such optimizations included eventually; but even without them, this is a very powerful feature. This feature doesn't have EIBO has a hard prerequisite, but it will be faster and cleaner if EIBO is done first. Ashutosh also has a patch for an optimization of the existing constraint exclusion logic which could speed up partition pruning when no partitions at all need to be scanned. Antonin Houska (Cybertec) and Jeevan Chalke (EnterpriseDB) have both done work on partition-wise aggregate, which means aggregating the partitions of a table one-by-one instead of aggregating all at once. If the partition key matches the grouping key, every partition will produce a discrete set of groups, and this technique should almost always be a win. If not, it can still be used by leveraging the multi-step aggregation infrastructure that is currently used for parallel aggregate: first, perform a PartialAggregate step on each partition, and then collect the results and perform a FinalizeAggregate. This will tend to win when individual partitions contain many values with the same grouping key, but it will be inferior to the straightforward approach when each partition contains, say, only a single value in each output group. Ronan Dunklau (Dalibo) and Julien Rouhaud (Dalibo) have another interesting idea for which they've got a draft patch: if we need ordered data from a partitioned table, and the ordering matches the partitioning scheme, we could skip the MergeAppend that is normally needed to produce ordered data in such circumstances. Instead, we could just scan the partitions in the correct order and do a simple Append of the results. I suspect that EIBO and maybe partition-wise join will make this easier to implement. While those are all of the things I know about right now, I think it's nearly certain that many other patches for partitioning improvements will show up as we get deeper into the release cycle; remember, all of these patches showed up before v11 development even started (that's due to happen any hour now). I'm quite excited to see how partitioning gets better in v11, v12, and beyond, and I expect that I and my colleagues at EnterpriseDB will be continuing to put a lot of effort in this for quite a while. We'll also be doing our best to help review patches from others in the area of partitioning, and suggestions and feedback are most welcome.
... View more
(Originally published on 8/3/2016) Last week, a blog post by an Uber engineer explained why Uber chose to move from PostgreSQL to MySQL. This article was widely reported and discussed within the PostgreSQL community, with many users and developers expressing the clear sentiment that Uber had indeed touched on some areas where PostgreSQL has room for improvement. I share that sentiment. I believe that PostgreSQL is a very good database, but I also believe there are plenty of things about it that can be improved. When users - especially well-known names like Uber - explain what did and did not work in their environment, that helps the PostgreSQL community, and the companies which employ many of its active developers, figure out what things are most in need of improvement. I'm happy to see the PostgreSQL community, of which I am a member, reacting to this in such a thoughtful and considered way. Having read the blog post a few times now, I keep finding myself wanting just a bit more detail. I've come to the conclusion that one of the big problems is that Uber really wanted logical replication rather than physical replication, but logical replication is not yet available in PostgreSQL core. Uber mentions that they considered pglogical, but it was not available on the server version they were using, which is somewhat older. They do not mention whether they tried any of the out-of-core replication solutions which are available, such as Slony, Bucardo, or Londiste; or, on the proprietary side, EnterpriseDB's xDB Replication Server. It would be great, if they did try any of those, to hear about their experiences. Regardless, I think that virtually the entire PostgreSQL community is in agreement that an in-core logical replication solution is important; I first blogged about this back in 2011, and we made great progress with PostgreSQL 9.4, which introduced logical decoding, but the full solution is still not there. Logical replication would, I believe, address Uber's concerns about the size of the change stream which they would need to send over their wide area network, cross-version replication, propagation of physical corruption, and query cancellation on replicas. Some of these issues can be addressed in other ways. For example, query cancellation on replicas can be prevented by configuring hot_standby_feedback=on, and I can't tell from the blog post whether Uber's engineers were aware of this feature. Write-ahead log files can be compressed before transmitting them over a wide area network, either by using SSL (which can do compression as part of the protocol) or by using archive_command to apply an arbitrary compression command (e.g. gzip) to each 16MB segment before transmitting it. Again, this may have been tried and been found ineffective or insufficient, but it would be nice to hear more of the details. For major version upgrades, they mention that pg_upgrade took many hours to complete in their environment, which is not supposed to happen. A number of performance problems with pg_upgrade related to the handling of databases containing large number of objects, have been fixed in newer releases, so things might have been better for them on 9.4 or 9.5, but without more details it's hard to know - and it would be good to know, because problems that can be understood can be fixed. Perhaps the thorniest problem which Uber raises is that of write amplification caused by secondary index updates. As the Uber post explains, there are some basic differences between the way PostgreSQL organizes data and the way MySQL organizes data, which I've written about before. PostgreSQL performs some update as "HOT" updates and the rest as "non-HOT". HOT updates touch no indexes, while a non-HOT update must touch every index. Therefore, the percentage of updates which are HOT has a critical impact on update performance, and any update that touches an indexed column is always non-HOT. For this reason, many experienced PostgreSQL users are quite cautious about adding indexes, since it is quite easy to create a great deal of additional index maintenance - both index insertions and later VACUUM activity - if the percentage of non-HOT updates falls. For most users, being conservative in adding indexes is sufficient to avoid this problem, but Uber isn't the only company to have problems of this type. In InnoDB, or any other system where the primary key index is a clustered index and secondary indexes reference the primary key rather than the physical tuple position, secondary index updates are less frequent than they are in PostgreSQL. This approach is not without disadvantages - for example, it may be space-inefficient if the primary key is very wide and every secondary index access must also traverse the clustered index - but it's unquestionably got some benefits, especially for tables that are both heavily indexed and heavily updated. What makes this problem thorny is that the PostgreSQL tuple format is pretty deeply embedded throughout the system, and it presupposes the current model for handling updates. We might be able to do better - HOT, itself, was an optimization as compared with our old strategy of ALWAYS updating every index - but I believe, and have believed for a while, that PostgreSQL also needs to be open to the idea of alternate on-disk formats, including but not limited to index-organized formats. In the past, there has been considerable skepticism within the PostgreSQL community about this kind of development direction, not only because developing a new on-disk format is hard, but also because of fears that the alternate formats would proliferate, dividing the community's development resources and perhaps even leading to multiple forks. While I'd like to hear more about Uber's experience - exactly how much secondary index traffic did they experience and were all of those secondary indexes truly necessary in the first place? - I view this overall as another piece of evidence suggesting that we need to look very hard at making our storage layer pluggable so that it's easier to experiment with new designs in this area. We may be able to get a bit more mileage out of extending the HOT optimization, but I think to really solve the problem is going to require something altogether new. Even after as many years as I've been working on PostgreSQL, and currently that's about 8, it's always amazing how much more work there is to be done. We've come a very long way in that time, adding innumerable performance features, replication features, and others. But every time I think that we've just about overcome all of the really serious problems, a new generation of problems shows up, desperately needing developer attention. Once more unto the breach! [After writing this blog post but before publishing it, I became aware that Markus Winand had already written an excellent post on this topic, which is very much worth reading.]
... View more
We support quite a large amount of additional syntax both for compatibility with Oracle syntax and also in order to provide additional features not present in PostgreSQL, such as resource groups. Because our list of features is extensive, the changes required to provide those features are also extensive. However, we try very hard not to damage the stability of existing PostgreSQL code. Obviously, nothing is perfect, but most of the EPAS-specific bug reports we get are bugs in the features we've added rather than bugs in the original PostgreSQL code, which seems to imply that we've been pretty successful keeping the code we inherit from PostgreSQL stable.
... View more
4/10/2017 The list of new features coming in PostgreSQL 10 is extremely impressive. I've been involved in the PostgreSQL project since the 8.4 release cycle (2008-2009), and I've never seen anything like this. Many people have already blogged about these features elsewhere; my purpose here is just to bring together a list of the features that, in my opinion, are the biggest new things that we can expect to see in PostgreSQL 10. [Disclaimers: (1) Other people may have different opinions. (2) It is not impossible that some patches could be reverted prior to release. (3) The list below represents the work of the entire PostgreSQL community, not specifically me or EnterpriseDB, and I have no intention of taking credit for anyone else's work.] Headline Features Declarative Partitioning. In previous versions of PostgreSQL, PostgreSQL supported only table inheritance, which could be used to simulate table partitioning, but it was complicated to set up and the performance characteristics were not that great. In PostgreSQL 10, it's possible to do list or range partitioning using dedicated syntax, and INSERT performance has been greatly improved. There is still a lot more work to do in future releases to improve performance and add missing features, but even what we have in v10 is already a major step forward (IMHO, anyway). Logical Replication. PostgreSQL has had physical replication -- often called streaming replication -- since version 9.0, but this requires replicating the entire database, cannot tolerate writes in any form on the standby server, and is useless for replicating across versions or database systems. PostgreSQL has had logical decoding -- basically change capture -- since version 9.4, which has been embraced with enthusiasm, but it could not be used for replication without an add-on of some sort. PostgreSQL 10 adds logical replication which is very easy to configure and which works at table granularity, clearly a huge step forward. It will copy the initial data for you and then keep it up to date after that. Improved Parallel Query. While PostgreSQL 9.6 offers parallel query, this feature has been significantly improved in PostgreSQL 10, with new features like Parallel Bitmap Heap Scan, Parallel Index Scan, and others. Speedups of 2-4x are common with parallel query, and these enhancements should allow those speedups to happen for a wider variety of queries. SCRAM Authentication. PostgreSQL offers a remarkable variety of different authentication methods, including methods such as Kerberos, SSPI, and SSL certificate authentication, which are intended to be highly secure. However, sometimes users just want to use passwords managed by the PostgreSQL server itself. In existing releases, this can be done either using the password authentication type, which just sends the user-supplied password over the wire, or via the md5 authentication type, which sends a hashed and salted version of the password over the wire. In the latter approach, stealing the hashed password from the database or sniffing it on the wire is equivalent to stealing the password itself, even if you can't compute a preimage. PostgreSQL 10 introduces scram authentication, specifically SCRAM-SHA-256, which is much more secure. Neither the information which the server stores on disk nor the contents of an authentication exchange suffice for the server to impersonate the client. Of course, the substitution of SHA-256 for MD5 is also a substantial improvement. See also Michael Paquier's blog on this topic. One point to note is that, unless you are using libpq, you will not be able to use this feature unless your particular client driver has been updated with SCRAM support, so it may be a while before this feature is universally available. Executor Speedups. Substantial parts of PostgreSQL's executor have been rewritten to make expression and targetlist projection faster; just-in-time compilation will be added in a future release. Hash aggregation has been rewritten to use a more efficient hash table and store narrower tuples in it, and work has also been done to speed up queries that compute multiple aggregates and joins where one side can be proven unique. Grouping sets now support hash aggregation. While all PostgreSQL releases typically contain at least some performance improvements, the rewrite of expression and targetlist projection is a particularly large and significant improvement which will benefit many users. Durable Hash Indexes. Hash indexes in PostgreSQL have suffered from years of long neglect; the situation will be noticeably improved in v10. The most notable change is that changes to a hash index now write WAL, which means that they are crash-safe and that they are properly replicated to standbys. However, a good deal of other work has been done, including the necessary prerequisite step of revamping the bucket split algorithm to improve performance and concurrency, caching the metapage for better performance, adding page-at-a-time vacuuming, and expanding them more gradually. Amit Kapila even writes about a case where they outperformed btree indexes. While there's certainly more work to be done here, I'm excited about these improvements. ICU Collation Support. In current releases, PostgreSQL relies exclusively on the collations supplied by the operating system, but this sometimes causes problems: collation behavior often differs between operating systems, especially between Linux and Windows, and it isn't always easy to find a collation for one operating system whose behavior matches that of some collation available on another system. Furthermore, at least on Red Hat, glibc regularly whacks around the behavior of OS-native collations in minor releases, which effectively corrupts PostgreSQL's indexes, since the index order might no longer match the (revised) collation order. To me, changing the behavior of a widely-used system call in a maintenance release seems about as friendly as locking a family of angry racoons in someone's car, but the glibc maintainers evidently don't agree. (In fact, there's one discussion where it's suggested that you not use some of those interfaces at all.) libicu, on the other hand, says they care about this. But Wait, There's More! In my estimation, the features listed above are the most exciting things that users can expect in PostgreSQL 10, which is expected to be released in September. However, there are quite a few other significant features as well which could easily have qualified as headline features in a release less jam-packed than this one. Here are some of them: Extended Statistics (ndistinct, functional dependencies). If the query planner makes a bad row count estimate resulting in a terrible plan, how do you fix it? With extended statistics, you can tell the system to gather additional statistics according to parameters that you specify, which may help it get the plan right. FDW Aggregate Pushdown. In previous releases, SELECT COUNT(*) FROM foreign_tableoperated by fetching every row form the foreign table and counting them locally. That was terrible, so now it doesn't. Transition Tables. It is now possible to write a PL/pgsql AFTER STATEMENT trigger which can access all rows modified by the statement. This can be both faster and more convenient than writing an AFTER ROW trigger that is called once per row. Improved Wait Events. PostgreSQL 9.6 introduced wait event monitoring in pg_stat_activity, but only for a limited range of events. In PostgreSQL 10, you'll be able to see latch waits and I/O waits, even for auxiliary processes and unconnected background workers. New Integrity Checking Tools. You can now validate the integrity of your btree indexes using the new amcheck module. If you're a developer adding write-ahead logging to a new storage form, or a user who thinks the developers may have introduced a bug, you'll be pleased to be able to test with wal_consistency_checking. pg_dump now has better test coverage. Smarter Connection Handling. Connections through libpq can now specify multiple hosts, and you can even tell it to find you the server that is currently accepting write connections. Quorum-Based Synchronous Replication. You can now specify that a commit must be acknowledged by any K of N standby synchronous servers, improving flexibility and performance. Other Cool Stuff Many other things have also been significantly improved in this release. XMLTABLE makes querying XML data faster and easier. You can now interrogate the commit status of a transaction directly, and we've got better tracking of replication lag. psql now supports \if ... \elseif ... \else ... \endif to make scripting easier, and there are new functions and new roles to allow monitoring tools to run without superuser privileges. Encoding conversions are now faster, and so is sorting. You can compress the transaction log while streaming it. And there's more, but this blog post is too long already. If you're interested in reading even more about new features that will be coming with PostgreSQL 10, depesz blogs frequently on this topic, and so does Michael Paquier. Both have additional details on some of the features mentioned here, as well as others that may be of interest. This final note: we have had chronic problems with users erroneously believing that the pg_xlog or pg_clog directory is non-critical data, possibly because the directory names include the word "log". Those directories have been renamed to pg_wal and pg_xact, which we hope will be clearer. All SQL functions and utility names that formerly included the string "xlog", meaning the transaction log or write-ahead log, have been renamed to use "wal" instead. Conversely, the default log directory is now called log rather than pg_log so that it is looks less like an internal name. These changes will probably cause a bit of upgrade pain for some users, but we hope that they will also help users to avoid catastrophic mistakes. If you're still reading this, thanks! I think it's going to be a great release. This post originally appeared on Robert's personal blog.
... View more
3/16/2017 A recent Twitter poll asked What is your favorite upcoming feature of PostgreSQL V10? In this admittedly unscientific survey, "better parallelism" (37%) beat out "logical replication" (32%) and "native partitioning" (31%). I think it's fruitless to argue about which of those features is actually most important; the real point is that all of those are amazing features, and PostgreSQL 10 is on track to be an amazing release. There are a number of already-committed or likely-to-be-committed features which in any other release would qualify as headline features, but in this release they'll have to fight it out with the ones mentioned above. Still, I can't say that I'm unhappy that better parallelism ended up being the most popular choice in the poll mentioned above, and I'd just like to highlight a few of the current and proposed improvements that will or may appear in this release. Already Committed These improvements have been committed, and will appear in PostgreSQL 10 unless, due to some issue, it becomes necessary to revert them. Parallel Bitmap Heap Scan (Dilip Kumar). The driving table of a parallel query must be scanned with a parallel-aware scan type capable of partitioning the scan among the available workers; in PostgreSQL 9.6, the only type of parallel-aware scan was a parallel sequential scan, which means that you had to choose between using parallelism and using your indexes. With a parallel bitmap heap scan, one process scans the index and builds a data structure in shared memory indicating all of the heap pages that need to be scanned, and then all cooperating processes can perform the heap scan in parallel. It would be nice to be able to perform the index scan in parallel, too, but that will have to wait for a future release. Parallel Index Scan (Rahila Syed, Amit Kapila, Robert Haas), Parallel Index-Only Scan (Rafia Sabih). Infrastructure has been added to allow index access methods to support parallel scans, the btree access method has been modified to use this infrastructure, and the optimizer and executor have been taught to use these capabilities for index scans and index only scans. Therefore, it's now possible for the driving table to be scanned using an index-scan or an index-only scan. I suspect that, because of the way this is implemented, the maximum degree of useful parallelism will be less in this case than for a parallel bitmap heap scan or parallel sequential scan. It's also true that index scans tend to lose to bitmap scans when the number of rows being returned is large, and if the number of rows being returned from the driving table isn't large, you probably aren't getting a lot of benefit out of parallel query. Still, these changes have led to large improvements on certain queries. Gather Merge (Rushabh Lathia). The Gather node introduced in PostgreSQL 9.6 gathers results from all workers in an arbitrary order. That's fine if the data that the workers were producing had no particular ordering anyway, but if each worker is producing sorted output, then it would be nice to gather those results in a way that preserves the sort order. This is what Gather Merge does. It can speed up queries where it's useful for the results of the parallel portion of the plan to have a particular sort order, and where the parallel portion of the plan produces enough rows that performing an ordinary Gather followed by a Sort would be expensive. Parallel Merge Join (Dilip Kumar). In PostgreSQL 9.6, only hash joins and nested loops can be performed in the parallel portion of a plan. In PostgreSQL 10, merge joins can also be performed in the parallel portion of the plan. In some sense, these are not truly parallel hash joins, because while each participant sees only a subset of the rows on the outer side of the join, each participant must visit every row on the inner side of the join; otherwise, the join results might be incomplete. This means that the work on the outer side of the join is divided among the participants, but the work on the inner side of the join is duplicated by every participant. Better strategies are possible, and have been documented in the academic literature on parallel query, but this still sometimes improves on what was possible in PostgreSQL 9.6, especially because parallel index scan provides a way of getting result in sorted order without needing an explicit sort. Subplan-Related Improvements (Amit Kapila). In PostgreSQL 9.6, subplans were not passed from the leader to the workers, and therefore any table which appeared in the query plan with an associated subquery was not a candidate for parallelism. In PostgreSQL 10, this restriction has been slightly relaxed. Now, a table with an uncorrelated subplan can appear in the parallel portion of the plan. Unfortunately, a table with a correlated subplans or a reference to an InitPlan is still not able to be considered for parallelism; there is substantial further work to be done in this area. Pass Query Text To Workers (Rafia Sabih). In PostgreSQL 9.6, the query text associated with a parallel worker does not show up in pg_stat_activity; in PostgreSQL 10, it will. Also, if you're unlucky enough to have a parallel worker crash, the query text for the query which it was running will now be reported, just as for crashes in user-connected backends. These improvements are helpful for monitoring and troubleshooting. Still Pending These improvements have been submitted for possible inclusion in PostgreSQL 10, but are neither committed nor deferred to a future release as of this writing. Parallel CREATE INDEX (Peter Geoghegan). The server-side infrastructure for parallelism was built in such a way as to allow it to be applied to utility commands as well as queries, but we haven't got any examples yet. This proposed work speeds up btree index creation by about a factor of three. Better Parallel Hash Join (Thomas Munro). In PostgreSQL 9.6, the only way to perform a hash join in parallel is for each worker to build a private copy of the hash table. Then, all workers scan the probe table in parallel and each probes its own hash table. This is fine, and perhaps optimal, when the build table is small, because each worker has entirely contention-free access to its own hash table. However, it's a poor strategy when the hash table is large or expensive to build. The proposed patch fixes this by allowing workers to build a single, shared hash table, optionally in parallel. For the most part, the advantage of this patch is not that it makes the join run any faster, but that it allows both inputs to the join to be constructed in parallel rather than only one of them. Pass InitPlan Values to Workers (Amit Kapila). If an InitPlan is attached to a query above a Gather node and used below the Gather node, we could compute the value in the leader and disseminate it to all the workers. Currently, plans of this type are disallowed because we have no mechanism for passing the value down to the workers; this patch adds such a mechanism, and allows the corresponding plans. Parallel Append (Amit Khandekar). Currently, Append can be used in parallel queries only in very limited ways. A partitioned table can be scanned in parallel, but every partition must be scanned using some type of parallel plan. While the changes mentioned above make this much less of a restriction in PostgreSQL 10 than it was in PostgreSQL 9.6, it's still possible for the fastest method of scanning a table to be some kind of non-parallel plan, such as an index scan on a non-btree index. Furthermore, when an Append does appear in the parallel portion of a plan, the first child plan will be scanned to completion before beginning the second child plan, which will be completed before beginning the third child plan, and so forth. It will often be better to spread out the workers so that, for example, I/O from different disks can happen in parallel, and to reduce CPU contention. This patch aims to address these problems, as well as to allow UNION ALL queries to be parallelized in the obvious way. Improve Access to Parallel Query from Procedural Languages (Rafia Sabih). In PostgreSQL 9.6, certain queries in PL/pgsql procedures can access parallel query, but other procedural languages cannot. Work is underway to improve access to parallel queries across all built-in procedural languages, both by adding support for languages other than PL/pgsql, and also by expanding the range of situations in which parallel query can be used from within PL/pgsql. Allow Parallel Query at SERIALIZABLE (Thomas Munro). The serializable isolation level does not current support parallel query. Patches are proposed to lift this restriction. With the exception of Peter Geoghegan, all of the developers mentioned above are employed by EnterpriseDB. I'd like to take this opportunity to thank EnterpriseDB management for continuing to support this work in the PostgreSQL community; 100% of the work that EnterpriseDB has done in this area has been submitted to the PostgreSQL community using the PostgreSQL community process and under the PostgreSQL license. We expect to continue making such contributions in future releases. In the process of developing, testing, and benchmarking these improvements, my colleagues and I have learned quite a bit about what distinguishes a good parallel query idea from a bad one, which I think will be useful in setting direction for future parallel query work. I'll share some insights about that topic in a future blog post, as this one has already grown quite long. Please feel free to add comments below! I'm interested to hear what you like and dislike about the new parallel query improvements, and about what work you would like to see done in the future. Robert M. Haas is Vice President, Chief Database Architect, at EnterpriseDB. This post originally appeared on Robert's personal blog.
... View more
The list of new features coming in PostgreSQL 10 is extremely impressive. I've been involved in the PostgreSQL project since the 8.4 release cycle (2008-2009), and I've never seen anything like this. Many people have already blogged about these features elsewhere; my purpose here is just to bring together a list of the features that, in my opinion, are the biggest new things that we can expect to see in PostgreSQL 10.
[Disclaimers: (1) Other people may have different opinions. (2) It is not impossible that some patches could be reverted prior to release. (3) The list below represents the work of the entire PostgreSQL community, not specifically me or EnterpriseDB, and I have no intention of taking credit for anyone else's work.]
Declarative Partitioning. In previous versions of PostgreSQL, PostgreSQL supported only table inheritance, which could be used to simulate table partitioning, but it was complicated to set up and the performance characteristics were not that great. In PostgreSQL 10, it's possible to do list or range partitioning using dedicated syntax, and INSERT performance has been greatly improved. There is still a lot more work to do in future releases to improve performance and add missing features, but even what we have in v10 is already a major step forward (IMHO, anyway).
Logical Replication. PostgreSQL has had physical replication -- often called streaming replication -- since version 9.0, but this requires replicating the entire database, cannot tolerate writes in any form on the standby server, and is useless for replicating across versions or database systems. PostgreSQL has had logical decoding -- basically change capture -- since version 9.4, which has been embraced with enthusiasm, but it could not be used for replication without an add-on of some sort. PostgreSQL 10 adds logical replication which is very easy to configure and which works at table granularity, clearly a huge step forward. It will copy the initial data for you and then keep it up to date after that.
Improved Parallel Query. While PostgreSQL 9.6 offers parallel query, this feature has been significantly improved in PostgreSQL 10, with new features like Parallel Bitmap Heap Scan, Parallel Index Scan, and others. Speedups of 2-4x are common with parallel query, and these enhancements should allow those speedups to happen for a wider variety of queries.
SCRAM Authentication. PostgreSQL offers a remarkable variety of different authentication methods, including methods such as Kerberos, SSPI, and SSL certificate authentication, which are intended to be highly secure. However, sometimes users just want to use passwords managed by the PostgreSQL server itself. In existing releases, this can be done either using the password authentication type, which just sends the user-supplied password over the wire, or via the md5 authentication type, which sends a hashed and salted version of the password over the wire. In the latter approach, stealing the hashed password from the database or sniffing it on the wire is equivalent to stealing the password itself, even if you can't compute a preimage. PostgreSQL 10 introduces scram authentication, specifically SCRAM-SHA-256, which is much more secure. Neither the information which the server stores on disk nor the contents of an authentication exchange suffice for the server to impersonate the client. Of course, the substitution of SHA-256 for MD5 is also a substantial improvement. See also Michael Paquier's blog on this topic. One point to note is that, unless you are using libpq, you will not be able to use this feature unless your particular client driver has been updated with SCRAM support, so it may be a while before this feature is universally available.
Executor Speedups. Substantial parts of PostgreSQL's executor have been rewritten to make expression and targetlist projection faster; just-in-time compilation will be added in a future release. Hash aggregation has been rewritten to use a more efficient hash table and store narrower tuples in it, and work has also been done to speed up queries that compute multiple aggregates and joins where one side can be proven unique. Grouping sets now support hash aggregation. While all PostgreSQL releases typically contain at least some performance improvements, the rewrite of expression and targetlist projection is a particularly large and significant improvement which will benefit many users.
Durable Hash Indexes. Hash indexes in PostgreSQL have suffered from years of long neglect; the situation will be noticeably improved in v10. The most notable change is that changes to a hash index now write WAL, which means that they are crash-safe and that they are properly replicated to standbys. However, a good deal of other work has been done, including the necessary prerequisite step of revamping the bucket split algorithm to improve performance and concurrency, caching the metapage for better performance, adding page-at-a-time vacuuming, and expanding them more gradually. Amit Kapila even writes about a case where they outperformed btree indexes. While there's certainly more work to be done here, I'm excited about these improvements.
ICU Collation Support. In current releases, PostgreSQL relies exclusively on the collations supplied by the operating system, but this sometimes causes problems: collation behavior often differs between operating systems, especially between Linux and Windows, and it isn't always easy to find a collation for one operating system whose behavior matches that of some collation available on another system. Furthermore, at least on Red Hat, glibc regularly whacks around the behavior of OS-native collations in minor releases, which effectively corrupts PostgreSQL's indexes, since the index order might no longer match the (revised) collation order. To me, changing the behavior of a widely-used system call in a maintenance release seems about as friendly as locking a family of angry racoons in someone's car, but the glibc maintainers evidently don't agree. (In fact, there's one discussion where it's suggested that you not use some of those interfaces at all.) libicu, on the other hand, says they care about this.
But Wait, There's More!
In my estimation, the features listed above are the most exciting things that users can expect in PostgreSQL 10, which is expected to be released in September. However, there are quite a few other significant features as well which could easily have qualified as headline features in a release less jam-packed than this one. Here are some of them:
Extended Statistics (ndistinct, functional dependencies). If the query planner makes a bad row count estimate resulting in a terrible plan, how do you fix it? With extended statistics, you can tell the system to gather additional statistics according to parameters that you specify, which may help it get the plan right.
FDW Aggregate Pushdown. In previous releases, SELECT COUNT(*) FROM foreign_tableoperated by fetching every row form the foreign table and counting them locally. That was terrible, so now it doesn't.
Transition Tables. It is now possible to write a PL/pgsql AFTER STATEMENT trigger which can access all rows modified by the statement. This can be both faster and more convenient than writing an AFTER ROW trigger that is called once per row.
Improved Wait Events. PostgreSQL 9.6 introduced wait event monitoring in pg_stat_activity, but only for a limited range of events. In PostgreSQL 10, you'll be able to see latch waits and I/O waits, even for auxiliary processes and unconnected background workers.
New Integrity Checking Tools. You can now validate the integrity of your btree indexes using the new amcheck module. If you're a developer adding write-ahead logging to a new storage form, or a user who thinks the developers may have introduced a bug, you'll be pleased to be able to test with wal_consistency_checking. pg_dump now has better test coverage.
Smarter Connection Handling. Connections through libpq can now specify multiple hosts, and you can even tell it to find you the server that is currently accepting write connections.
Quorum-Based Synchronous Replication. You can now specify that a commit must be acknowledged by any K of N standby synchronous servers, improving flexibility and performance.
Other Cool Stuff
Many other things have also been significantly improved in this release. XMLTABLE makes querying XML data faster and easier. You can now interrogate the commit status of a transaction directly, and we've got better tracking of replication lag. psql now supports \if ... \elseif ... \else ... \endif to make scripting easier, and there are new functions and new roles to allow monitoring tools to run without superuser privileges. Encoding conversions are now faster, and so is sorting. You can compress the transaction log while streaming it. And there's more, but this blog post is too long already. If you're interested in reading even more about new features that will be coming with PostgreSQL 10, depesz blogs frequently on this topic, and so does Michael Paquier. Both have additional details on some of the features mentioned here, as well as others that may be of interest.
This final note: we have had chronic problems with users erroneously believing that the pg_xlog or pg_clog directory is non-critical data, possibly because the directory names include the word "log". Those directories have been renamed to pg_wal and pg_xact, which we hope will be clearer. All SQL functions and utility names that formerly included the string "xlog", meaning the transaction log or write-ahead log, have been renamed to use "wal" instead. Conversely, the default log directory is now called log rather than pg_log so that it is looks less like an internal name. These changes will probably cause a bit of upgrade pain for some users, but we hope that they will also help users to avoid catastrophic mistakes.
If you're still reading this, thanks! I think it's going to be a great release.
Robert H. Haas is Vice President, Chief Database Architect, at EnterpriseDB.
This post originally appeared on Robert's personal blog.
... View more
4/27/2017 In a talk which I gave at PGCONF.IN and, in a shorter version, at PGCONF.US, I had a few slides on who contributes to PostgreSQL development. Here, I'd like to present a slightly expanded version of the information which was in the talk. The information in this post considers calendar year 2016 and comes from two sources. First, I went through the PostgreSQL commit log for 2016, manually tagged each commit by principal author, and recorded the number of new lines of code added by that commit based on git diff --stat -w -M, options which are intended to suppress (more or less successfully) whitespace-only changes and changes due to file renames. I also manually eliminated a few large mechanical commits, principally translation updates. Second, Thom Brown extracted the authors of every email sent to the pgsql-hackers mailing list during 2016, and I then cleaned that up and normalized the names in an attempt to make sure that all emails sent by the same person were counted under one name. Note that, because this data is all for calendar year 2016, it includes the end of the PostgreSQL 9.6 development cycle and the beginning of the PostgreSQL 10 development cycle. I feel that this data, taken together, presents a reasonable view of who is contributing to PostgreSQL development. From the commit log data, we can see who is writing code, and also who is committing that code when it gets written. From the email counts, we can see who is participating in mailing list discussions, which captures - at least to some degree - the work of reviewing patches, providing feedback on designs, reporting problems, etc. Neither measure is perfect; notably, anyone who was frequently the second author on a patch might be under-represented in these numbers, and two people could have written the same number of emails yet one of them might have written much more detailed, thoughtful, and useful emails. Nonetheless, I believe these numbers do a fairly good job of capturing who did the work of moving PostgreSQL development forward during calendar year 2016. Note that this considers only core development. Many other people contribute by contributing to projects such as pgAdmin, pgpool, pgbouncer, and various PostgreSQL connectors; others contribute to user education, advocacy, web site maintenance, and other efforts. I think it would be useful to see statistics on those types of contributions as well, but I leave it to people more familiar with those areas to judge how such contributions would be best measured. Disclaimers aside, and before we get into the details, here are some quick overall statistics: In 2016, 141 people contributed at least 1 new line of code to PostgreSQL. 37 of those people account for 90% of the new lines of code contributed to PostgreSQL in 2016, and 14 of them account for 66% of the new lines of code contributed to PostgreSQL during 2016. In 2016, 18 committers committed at least one patch for which they were not the principal author. 90% of the lines of code for which the principal committer was not the author were committed by 6 committers, and 66% of the lines of code for which the principal committer was not the author were committed by 2 committers. In 2016, 528 people (modulo duplicate email addresses that I couldn't identify as belonging to the same person) sent at least 1 email to pgsql-hackers. 90% of those emails were sent by 78 people, and 66% of those emails were sent by 23 people. Now, here are the detailed charts. First, here are the 37 people who were the principal authors of 90% of lines of new code contributed during 2016. Non-committers are marked with an asterisk. "lines" shows the number of lines of code for which that person was the principal author, "pct_lines" shows that as a percentage of the total lines contributed, and "commits" is the number of commits across which those lines were spread. # | author | lines | pct_lines | commits
1 | Tom Lane | 62077 | 29.20 | 637
2 | Amit Langote [*] | 9889 | 4.65 | 30
3 | Robert Haas | 9685 | 4.55 | 108
4 | Stephen Frost | 9177 | 4.32 | 46
5 | Teodor Sigaev | 8345 | 3.92 | 28
6 | Michael Paquier [*] | 7778 | 3.66 | 106
7 | Andres Freund | 5913 | 2.78 | 61
8 | David Rowley [*] | 5582 | 2.63 | 26
9 | Alexander Korotkov [*] | 5174 | 2.43 | 11
10 | Peter Eisentraut | 4877 | 2.29 | 161
11 | Heikki Linnakangas | 4378 | 2.06 | 42
12 | Thomas Munro [*] | 3535 | 1.66 | 31
13 | Magnus Hagander | 3494 | 1.64 | 26
14 | Amit Kapila [*] | 3480 | 1.64 | 35
15 | Kevin Grittner | 3103 | 1.46 | 23
16 | Andreas Karlsson [*] | 3062 | 1.44 | 33
17 | Bruce Momjian | 3049 | 1.43 | 27
18 | Fabien Coelho [*] | 2768 | 1.30 | 22
19 | Shigeru Hanada [*] | 2752 | 1.29 | 3
20 | Alvaro Herrera | 2636 | 1.24 | 56
21 | Jeevan Chalke [*] | 2454 | 1.15 | 2
22 | Etsuro Fujita [*] | 2378 | 1.12 | 21
23 | Kyotaro Horiguchi [*] | 2171 | 1.02 | 19
24 | Masahiko Sawada [*] | 2129 | 1.00 | 20
25 | Peter Geoghegan [*] | 2121 | 1.00 | 25
26 | Tomas Vondra [*] | 2084 | 0.98 | 10
27 | Craig Ringer [*] | 2063 | 0.97 | 17
28 | Artur Zakirov [*] | 1962 | 0.92 | 13
29 | Andrew Gierth [*] | 1726 | 0.81 | 5
30 | Dean Rasheed | 1627 | 0.77 | 10
31 | Daniel Vérité [*] | 1530 | 0.72 | 4
32 | Emre Hasegeli [*] | 1497 | 0.70 | 5
33 | Joe Conway | 1471 | 0.69 | 10
34 | Noah Misch | 1430 | 0.67 | 31
35 | Jim Nasby [*] | 1404 | 0.66 | 11
36 | Petr Jelinek [*] | 1400 | 0.66 | 10
37 | Pavel Stehule [*] | 1254 | 0.59 | 6 Next, here are all of the committers who committed code for which they were not the principal author during 2016; in other words, these are committers that committed code written by non-committers (or possibly by another committer, but that's rare). "lines" is the number of new lines of code added, "pct_lines" is that same number as a percentage of the total, and "commits" is the number of commits across which those lines were spread. # | committer | lines | pct_lines | commits
1 | Robert Haas | 37726 | 40.03 | 241
2 | Tom Lane | 25293 | 26.84 | 204
3 | Alvaro Herrera | 7611 | 8.08 | 59
4 | Teodor Sigaev | 7252 | 7.70 | 32
5 | Heikki Linnakangas | 4191 | 4.45 | 33
6 | Peter Eisentraut | 3588 | 3.81 | 56
7 | Andres Freund | 2558 | 2.71 | 22
8 | Simon Riggs | 1886 | 2.00 | 21
9 | Fujii Masao | 1626 | 1.73 | 21
10 | Magnus Hagander | 638 | 0.68 | 30
11 | Noah Misch | 533 | 0.57 | 10
12 | Andrew Dunstan | 426 | 0.45 | 6
13 | Kevin Grittner | 401 | 0.43 | 8
14 | Stephen Frost | 381 | 0.40 | 6
15 | Dean Rasheed | 56 | 0.06 | 1
16 | Bruce Momjian | 49 | 0.05 | 10
17 | Joe Conway | 23 | 0.02 | 1
18 | Michael Meskes | 5 | 0.01 | 1 Finally, here are the 78 people who sent 90% of the emails to pgsql-hackers in 2016, with the number of emails sent by each and the same as a percentage of the total. # | author | emails | pct_emails
1 | Tom Lane | 2911 | 11.00
2 | Robert Haas | 2682 | 10.14
3 | Michael Paquier | 1679 | 6.35
4 | Andres Freund | 1344 | 5.08
5 | Alvaro Herrera | 913 | 3.45
6 | Amit Kapila | 789 | 2.98
7 | Craig Ringer | 680 | 2.57
8 | Peter Eisentraut | 631 | 2.38
9 | Pavel Stehule | 583 | 2.20
10 | Amit Langote | 562 | 2.12
11 | Peter Geoghegan | 551 | 2.08
12 | Kyotaro Horiguchi | 502 | 1.90
13 | Stephen Frost | 443 | 1.67
14 | Jim Nasby | 437 | 1.65
15 | Bruce Momjian | 364 | 1.38
16 | Tomas Vondra | 349 | 1.32
17 | Fabien Coelho | 330 | 1.25
18 | Magnus Hagander | 311 | 1.18
19 | Simon Riggs | 311 | 1.18
20 | Ashutosh Bapat | 293 | 1.11
21 | Petr Jelinek | 283 | 1.07
22 | David Steele | 270 | 1.02
23 | Heikki Linnakangas | 264 | 1.00
24 | Kevin Grittner | 262 | 0.99
25 | Thomas Munro | 261 | 0.99
26 | David G. Johnston | 259 | 0.98
27 | Etsuro Fujita | 252 | 0.95
28 | Haribabu Kommi | 217 | 0.82
29 | Noah Misch | 215 | 0.81
30 | Masahiko Sawada | 209 | 0.79
31 | David Rowley | 205 | 0.77
32 | Andrew Dunstan | 178 | 0.67
33 | Jeff Janes | 177 | 0.67
34 | Joshua D. Drake | 177 | 0.67
35 | Fujii Masao | 162 | 0.61
36 | Alexander Korotkov | 155 | 0.59
37 | Dilip Kumar | 154 | 0.58
38 | Joe Conway | 151 | 0.57
39 | Takayuki Tsunakawa | 133 | 0.50
40 | Tatsuo Ishii | 127 | 0.48
41 | Aleksander Alekseev | 126 | 0.48
42 | Andreas Karlsson | 126 | 0.48
43 | Greg Stark | 123 | 0.46
44 | Corey Huinker | 122 | 0.46
45 | David Fetter | 113 | 0.43
46 | Vitaly Burovoy | 107 | 0.40
47 | Merlin Moncure | 106 | 0.40
48 | Claudio Freire | 105 | 0.40
49 | Julien Rouhaud | 100 | 0.38
50 | Teodor Sigaev | 93 | 0.35
51 | Kouhei Kaigai | 91 | 0.34
52 | Josh Berkus | 89 | 0.34
53 | Artur Zakirov | 87 | 0.33
54 | Konstantin Knizhnik | 83 | 0.31
55 | Anastasia Lubennikova | 80 | 0.30
56 | Daniel Verite | 80 | 0.30
57 | Oleksandr Shulgin | 80 | 0.30
58 | Yury Zhuravlev | 79 | 0.30
59 | Mithun Cy | 76 | 0.29
60 | Christoph Berg | 75 | 0.28
61 | Andreas Seltenreich | 74 | 0.28
62 | Christian Ullrich | 74 | 0.28
63 | Pavan Deolasee | 71 | 0.27
64 | Vladimir Sitnikov | 71 | 0.27
65 | Dean Rasheed | 68 | 0.26
66 | Oleg Bartunov | 67 | 0.25
67 | Stas Kelvich | 67 | 0.25
68 | Jesper Pedersen | 63 | 0.24
69 | Thom Brown | 63 | 0.24
70 | Karl O. Pinc | 61 | 0.23
71 | Robbie Harwood | 58 | 0.22
72 | Andrew Gierth | 55 | 0.21
73 | Vik Fearing | 55 | 0.21
74 | Andrew Borodin | 53 | 0.20
75 | Victor Wagner | 52 | 0.20
76 | Chapman Flack | 50 | 0.19
77 | Fabrízio de Royes Mello | 50 | 0.19
78 | Rushabh Lathia | 50 | 0.19 Thanks to all who contributed to PostgreSQL development during 2016! A database dump of the data used to construct these reports is available for those who may find it useful. Robert Haas is Vice President, Chief Database Architect, at EnterpriseDB. This post originally appeared on his personal blog.
... View more