Bloat in PostgreSQL: a taxonomy
PostgreSQL's approach to transaction management uses MVCC (multi-version concurrency control). Postgres often maintains multiple physical versions of a single logical row. This is used to reconstruct the logical contents of tables at a specific point in time for SQL queries (zero or one row versions should be visible for each logical row). MVCC avoids having readers block writers and writers block readers, a frequent problem with database systems that use traditional 2PL. However, there is a cost to this approach: bloat must eventually be removed and reclaimed, typically by an autovacuum worker process.
Most Postgres DBAs are familiar with bloat, and almost as many will have some experience with tuning autovacuum to better manage it. There have been quite a few talks about the practical aspects of optimizing autovacuum and avoiding bloat; this talk isn't one of them. Instead, the goal of the talk is to show how bloat can accumulate, what that looks like at the page level and at the level of entire tables and indexes, and how that may impact production queries.
The talk covers:
-
How VACUUM processes each structure, and in what order.
-
How the HOT optimization works.
-
How Postgres manages free space.
-
The design of VACUUM. What space/bloat management tasks are prioritized by VACUUM, and why this makes sense.