With the recent release of BART 2.0, EnterpriseDB is excited to introduce a new feature: block-level incremental backup. As you would expect, when you perform an incremental backup, BART only stores the data that has been modified since the "parent" backup - the parent may be an earlier full backup or an earlier incremental backup.
Most incremental backup tools (for PostgreSQL) store an entire file if any block within that file has changed. What is new and different about BART 2.0 is that an incremental backup stores only those blocks that have been modified. If you have a 1GB table, for example, and you modify 10% of the blocks within that table, BART will only store those modified blocks, giving you a space reduction of 90%. With other incremental backup tools, you'll end up storing the entire 1GB file, 90% of which is redundant.
In addition to reducing the space required for each incremental backup, BART gives you a big performance win too.
In order to identify those blocks which have been modified, BART scans each WAL segment as those segments are archived. Because the write-ahead log (WAL) keeps a complete and detailed history of all of the changes you make to a cluster, BART can scan the WAL stream to extract a list of those blocks that you have changed. Those modifications are recorded in a set of bitmaps (one bitmap for each relation file, one bit for each modified block) - we call these the modified block maps (or MBM files).
When you perform an incremental backup, BART combines the MBM (modified block map) files into a CBM (combined block map). This is a very quick operation - we simply OR the bitmaps together to come up with a single bitmap (per relation file) that indicates the set of blocks modified since the parent backup.
Once the CBM is complete, BART harvests the modified blocks from the cluster and stores them in the vault (the directory that holds all of your backups).
There are several important optimizations that boost performance during an incremental backup:
1 - BART identifies modified blocks by scanning each WAL segment as it is archived so there is no penalty to pay during an incremental backup.
2 - As we mentioned before, BART only harvests those blocks that were modified since the parent backup.
3 - By consulting the CBM (combined block map), BART will harvest each modified block exactly once, regardless of how many times the block has been modified (compare that to WAL replay, where the database server replays every change, even if those changes are overwritten by a future modification).
4 - BART can make better use of your hardware because it runs multiple harvesters in parallel. You choose how many harvesters to run based on your your hardware capabilities and the load you are willing to add to the database server (while the backup runs).
When you restore an incremental backup, BART begins by restoring ancestor backups, starting with the most recent full backup. Once the full backup has been restored, BART consults the CBM file and overlays the modified blocks (from the vault) on top of the restored cluster. Finally, BART restores any WAL files generated while the incremental backup was running.
Again, BART includes several optimizations that make it faster to restore from an incremental backup, compared to a full backup + WAL replay:
1 - After restoring the most recent full backup, BART overlays only those blocks indicated by the CBM (that is, only overlays those blocks that changed since the previous backup).
2 - BART restores each modified block exactly once (per backup), regardless of how many times the block was modified (again, compare this to WAL replay).
We've benchmarked several scenarios with BART to try to understand and quantify the performance benefits that we achieve with block-level incremental backups.
Our benchmarks run on a 2TB database, using two c4.8xlarge Amazon EC2 instances, each with several 5000 IOPS SSD drives. The database server lives one one host, the BART vault lives on the other. We have a 10Gbps network link between the two hosts. The benchmarks run several times to even out any "noise".
BART uses pg_basebackup to perform full backups - backing up the entire 2TB database takes approx 6 1/2 hours on the above hardware. (Note: we plan to implement a much faster full backup technique in an upcoming release; pg_basebackup is single threaded and not particularly fast).
After creating a full backup (which we'll call SUNDAY), we run a procedure that changes 5% of the blocks in each table and then create an incremental backup (MONDAY)
The MONDAY backup takes 42 minutes - it captures only those database blocks that have been modified since the full back created on SUNDAY. Without incremental backup, pg_basebackup would have taken approx. 6 1/2 hours. Other backup tools would take considerably more time than BART because they would backup every block in each table (that is, 2TB instead of BART's 100GB).
The benchmark continues to change 5% (a different 5%) of the blocks in the database, generating incremental backups (TUESDAY, WEDNESDAY, ...)after each procedure. Each incremental backup requires approximately 42 minutes (and 100GB).
We also create differential backups during the benchmark. A differential backup is like an incremental backup except that the parent is the most recent full backup instead of the most recent incremental backup. Differential backups take longer to create but are still much faster than a full backup (and the differential backups require a fraction of the disk space).
We see similar performance improvements when restoring incremental and differential backups. Restoring a full 2TB backup (SUNDAY) takes approx 3 hours. For example, restoring a differential backup after 5 days worth of changes (SUNDAY through FRIDAY, 25% of the database) required 4 hours, 57 minutes. Compare that to restoring the full backup and the replaying 5 days worth of WAL stream at 8 hours 38 minutes and you can see the benefit.
We already have big plans for BART's future.
In the next feature release, we will add parallelism to the restore operation - the block overlay phase will utilize multiple worker processes to greatly improve performance. You've seen that BART 2.0 is already faster than a full restore + WAL replay; parallel restore will make BART even faster.
Because BART builds a "database" of modified blocks, we can combine those blocks in interesting ways to reduce your RTO (Recovery Time Objective), allowing you to recover from failures much more quickly. For example, an upcoming release will introduce synthetic backups that allow you to pre-assemble (or roll up) a full backup along with a chain of incrementals into a single backup, ready to restore.
We also plan to integrate BART with other complementary products and technologies, such as offline storage systems, filesystem snapshots, and Postgres Enterprise Manager (for configuration, automation, monitoring, and alerting). In the very near future, BART will offer a RESTful API that you can use to trigger backups and restores, alter and inspect BART configuration, and query and manage the vault.
And in an upcoming release, you will be able to place your vault (that's where the backups live) on any host/drive/device to optimize performance for your particular environment. This flexibility can improve performance and makes it easier to integrate with other tools that you may already use.