PostgreSQL® Archiv - Page 2 of 3 - credativ®

On Thursday, 27 June, and Friday, 28 June 2024, I had the amazing opportunity to attend Swiss PGDay 2024. The conference was held at the OST Eastern Switzerland University of Applied Sciences, Campus Rapperswil, which is beautifully situated on the banks of Lake Zurich in a nice, green environment. With approximately 110 attendees, the event had mainly a B2B focus, although not exclusively. Despite the conference being seemingly smaller in scale compared to PostgreSQL events in larger countries, it actually reflected perfectly the scope relevant for Switzerland.

During the conference, I presented my talk “GIN, BTREE_GIN, GIST, BTREE_GIST, HASH & BTREE Indexes on JSONB Data“. The talk summarized the results of my long-term project at NetApp, including newer interesting findings compared to the presentation I gave in Prague at the beginning of June. As far as I could tell, my talk was well received by the audience, and I received very positive feedback.

At the very end on Friday, I also presented a lightning talk, “Can PostgreSQL Have a More Prominent Role in the AI Boom?” (my slides are at the end of the file). In this brief talk, I raised the question of whether it would be possible to implement AI functionality directly into PostgreSQL, including storing embedding models and trained neural networks within the database. Several people in the audience, involved with ML/AI, reacted positively on this proposal, acknowledging that PostgreSQL could indeed play a more significant role in ML and AI topics.

The conference featured two tracks of presentations, one in English and the other in German, allowing for a diverse range of topics and speakers. I would like to highlight some of them:

At the end of the first day, all participants were invited to a social event for networking and personal exchange, which was very well organized. I would like to acknowledge the hard work and dedication of all the organizers and thank them for their efforts. Swiss PGDay 2024 was truly a memorable and valuable experience, offering great learning opportunities. I am grateful for the chance to participate and contribute to the conference, and I look forward to future editions of this event. I am also very thankful to NetApp-credativ for making my participation in the conference possible.

Photos by organizers, Gülçin Yıldırım Jelínek and author:

 

 

   

Hashicorp Terraform is a well-known infrastructure automation tool mostly targeting cloud deployments. Instaclustr (a part of NetApp’s CloudOps division and credativ’s parent company) provides a managed service for various data stores, including PostgreSQL. Provisioning managed clusters is possible via the Instaclustr console, a REST API or through the Instaclustr Terraform Provider.

In this first part of a blog series, it is shown how a Postgres cluster can be provisioned using the Instaclustr Terraform Provider, including whitelisting the IP address and finally connecting to it via the psql command-line client.

Initial Setup

The general requirement is having an account for the Instaclustr Managed service. The web console is located at https://console2.instaclustr.com/. Next, a provisioning API key needs to be created if not available already, as explained here.

Terraform providers are usually defined and configured in a file called provider.tf. For the Instaclustr Terraform provider, this means adding it to the list of required providers and setting the API key mentioned above:

terraform {
  required_providers {
    instaclustr = {
      source  = "instaclustr/instaclustr"
      version = ">= 2.0.0, < 3.0.0"
    }
  }
}

variable "ic_username" {
  type = string
}
variable "ic_api_key" {
  type = string
}

provider "instaclustr" {
    terraform_key = "Instaclustr-Terraform ${var.ic_username}:${var.ic_api_key}"
}

Here, ic_username and ic_api_key are defined as variables. They should be set in a terraform.tfvars file in the same directory

ic_username       = "username"
ic_api_key        = "0db87a8bd1[...]"

As the final preparatory step, Terraform needs to be initialized, installing the provider:

$ terraform init

Initializing the backend...

Initializing provider plugins...
- Finding instaclustr/instaclustr versions matching ">= 2.0.0, < 3.0.0"...
- Installing instaclustr/instaclustr v2.0.136...
- Installed instaclustr/instaclustr v2.0.136 (self-signed, key ID 58D5F4E6CBB68583)

[...]

Terraform has been successfully initialized!

Defining Resources

Terraform resources define infrastructure objects, in our case a managed PostgreSQL cluster. Customarily, they are defined in a main.tf file, but any other file name can be chosen:

resource "instaclustr_postgresql_cluster_v2" "main" {
  name                    = "username-test1"
  postgresql_version      = "16.2.0"
  private_network_cluster = false
  sla_tier                = "NON_PRODUCTION"
  synchronous_mode_strict = false
  data_centre {
    name                         = "AWS_VPC_US_EAST_1"
    cloud_provider               = "AWS_VPC"
    region                       = "US_EAST_1"
    node_size                    = "PGS-DEV-t4g.small-5"
    number_of_nodes              = "2"
    network                      = "10.4.0.0/16"
    client_to_cluster_encryption = true
    intra_data_centre_replication {
      replication_mode           = "ASYNCHRONOUS"
    }
    inter_data_centre_replication {
      is_primary_data_centre = true
    }
  }
}

The above defines a 2-node cluster named username-test1 (and referred to internally as main by Terraform) in the AWS US_EAST_1 region with PGS-DEV-t4g.small-5 instance sizes (2 vCores, 2 GB RAM, 5 GB data disk) for the nodes. Test/developer instance sizes for the other cloud providers would be:

Cloud ProviderDefault RegionInstance SizeData DiskRAMCPU
AWS_VPCUS_EAST_1PGS-DEV-t4g.small-55 GB2 GB2 Cores
AWS_VPCUS_EAST_1PGS-DEV-t4g.medium-3030 GB4 GB2 Cores
AZURE_AZCENTRAL_USPGS-DEV-Standard_DS1_v2-5-an5 GB3.5 GB1 Core
AZURE_AZCENTRAL_USPGS-DEV-Standard_DS1_v2-30-an30 GB3.5 GB1 Core
GCPus-west1PGS-DEV-n2-standard-2-55 GB8 GB2 Cores
GCPus-west1PGS-DEV-n2-standard-2-3030 GB8 GB2 Core

Other instance sizes or regions can be looked up in the console or in the section node_size of the Instaclustr Terraform Provider documentation.

Running Terraform

Before letting Terraform provision the defined resources, it is best-practice to run terraform plan. This lets Terraform plan the provisioning as a dry-run, and makes it possible to review the expected actions before creating any actual infrastructure:

$ terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # instaclustr_postgresql_cluster_v2.main will be created
  + resource "instaclustr_postgresql_cluster_v2" "main" {
      + current_cluster_operation_status = (known after apply)
      + default_user_password            = (sensitive value)
      + description                      = (known after apply)
      + id                               = (known after apply)
      + name                             = "username-test1"
      + pci_compliance_mode              = (known after apply)
      + postgresql_version               = "16.2.0"
      + private_network_cluster          = false
      + sla_tier                         = "NON_PRODUCTION"
      + status                           = (known after apply)
      + synchronous_mode_strict          = false

      + data_centre {
          + client_to_cluster_encryption     = true
          + cloud_provider                   = "AWS_VPC"
          + custom_subject_alternative_names = (known after apply)
          + id                               = (known after apply)
          + name                             = "AWS_VPC_US_EAST_1"
          + network                          = "10.4.0.0/16"
          + node_size                        = "PGS-DEV-t4g.small-5"
          + number_of_nodes                  = 2
          + provider_account_name            = (known after apply)
          + region                           = "US_EAST_1"
          + status                           = (known after apply)

          + intra_data_centre_replication {
              + replication_mode = "ASYNCHRONOUS"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.
[...]

When the planned output looks reasonable, it can be applied via terraform apply:

$ terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
  + create
[...]
Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

instaclustr_postgresql_cluster_v2.main: Creating...
instaclustr_postgresql_cluster_v2.main: Still creating... [10s elapsed]
[...]
instaclustr_postgresql_cluster_v2.main: Creation complete after 5m37s [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

That is it! The PostgreSQL cluster is now up and running after barely 5 minutes.

IP Whitelisting

In order to access the PostgreSQL cluster, firewall rules need to be defined for IP-whitelisting. In general, any network address block can be defined, but in order to allow access from the host running Terraform, a firewall rule for the local public IP address can be set via a service like icanhazip.com, appending to main.tf:

data "http" "myip" {
  url = "https://ipecho.net/plain"
}

resource "instaclustr_cluster_network_firewall_rules_v2" "main" {
  cluster_id = resource.instaclustr_postgresql_cluster_v2.main.id

  firewall_rule {
    network = "${chomp(data.http.myip.response_body)}/32"
    type    = "POSTGRESQL"
  }
}

The usage of the http Terraform module also needs an update to the providers.tf file, adding it to the list of required providers:

terraform {
  required_providers {
    instaclustr = {
      source  = "instaclustr/instaclustr"
      version = ">= 2.0.0, < 3.0.0"
    }
    http = {
      source = "hashicorp/http"
      version = "3.4.3"
    }
  }
}

And a subsequent re-run of terraform init, followed by terraform apply:

$ terraform init

Initializing the backend...

Initializing provider plugins...
- Reusing previous version of instaclustr/instaclustr from the dependency lock file
- Finding hashicorp/http versions matching "3.4.3"...
- Using previously-installed instaclustr/instaclustr v2.0.136
- Installing hashicorp/http v3.4.3...
- Installed hashicorp/http v3.4.3 (signed by HashiCorp)

[...]
Terraform has been successfully initialized!
[...]

$ terraform apply

data.http.myip: Reading...
instaclustr_postgresql_cluster_v2.main: Refreshing state... [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]
data.http.myip: Read complete after 1s [id=https://ipv4.icanhazip.com]

Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # instaclustr_cluster_network_firewall_rules_v2.main will be created
  + resource "instaclustr_cluster_network_firewall_rules_v2" "main" {
      + cluster_id = "704e1c20-bda6-410c-b95b-8d22ef3f5a04"
      + id         = (known after apply)
      + status     = (known after apply)

      + firewall_rule {
          + deferred_reason = (known after apply)
          + id              = (known after apply)
          + network         = "123.134.145.5/32"
          + type            = "POSTGRESQL"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

instaclustr_cluster_network_firewall_rules_v2.main: Creating...
instaclustr_cluster_network_firewall_rules_v2.main: Creation complete after 2s [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Connecting to the Cluster

In order to connect to the newly-provisioned Postgres cluster, we need the public IP addresses of the nodes, and the password of the administrative database user, icpostgresql. Those are retrieved and stored in the Terraform state by the Instaclustr Terraform provider, by default in a local file terraform.tfstate. To secure the password, one can change the password after initial connection, secure the host Terraform is run from, or store the Terraform state remotely.

The combined connection string can be setup as an output variable in a outputs.tf file:

output "connstr" {
  value = format("host=%s user=icpostgresql password=%s dbname=postgres target_session_attrs=read-write",
          join(",", [for node in instaclustr_postgresql_cluster_v2.main.data_centre[0].nodes:
               format("%s", node.public_address)]),
                      instaclustr_postgresql_cluster_v2.main.default_user_password
                )
  sensitive = true
}

After another terraform apply to set the output variable, it is possible to connect to the PostgreSQL cluster without having to type or paste the default password via:

$ psql "$(terraform output -raw connstr)"
psql (16.3 (Debian 16.3-1.pgdg120+1), server 16.2 (Debian 16.2-1.pgdg110+2))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

postgres=> 

Conclusion

In this first part, the provisioning of an Instaclustr Managed PostgreSQL cluster with Terraform was demonstrated. In the next part of this blog series, we plan to present a Terraform module that makes it even easier to provision PostgreSQL clusters. We will also check out which input variables can be set to further customize the managed PostgreSQL cluster.

Instaclustr offers a 30-day free trial for its managed service which allows to provision clusters with development instance sizes, so you can signup and try the above yourself today!

On Wednesday, June 5th, I attended the Prague PostgreSQL Developers Day 2024. It is the premier PostgreSQL conference in the Czech Republic, and this year marked its 16th iteration. The event was held in the modern and convenient environment of the Czech Technical University and was attended by almost 270 participants.

During the conference, I presented my talk, “GIN, BTREE_GIN, GIST, and BTREE Indexes on JSONB Data.” This talk summarized the current findings of my project at NetApp (credativ), which I initiated to deepen our understanding of these indexing methods and their performance results. Our goal is to provide relevant and valuable solutions to our customers, who often struggle with implementing JSONB columns and operations into their applications and find the available information insufficient. Even existing AI models fall short as they rely on the same limited publicly available data. The project is focused on JSONB data; however, the results have already shown applicability beyond just this type of data. The positive reactions from the audience indicated that my presentation was well-received. The conference is a bilingual event, with presentations in both Czech and English. Being from Czechia, I delivered my talk in Czech but used English slides.

The conference also featured seven other insightful talks throughout the day. The first talk by Jan Karremans from Cybertec delved into the CloudNativePG operator, which implements PostgreSQL to run on Kubernetes. The second talk by Jakub Zemanek from initMax provided a detailed guide on configuring PostgreSQL accounts based on Active Directory using Ansible and the ldap2pg program. Gülçin Yıldırım Jelínek from EDB followed with a practical example of implementing vector storage of AI embeddings in PostgreSQL. I also really enjoyed the engaging and informative talk by Boriss Mejias from EDB, who explained different types of consistency and ways to ensure them in PostgreSQL using synchronous or asynchronous replication, and colored his talk with practical examples from a big musical event, where payments from multiple sources must be quickly and safely distributed throughout the system. Pavel Stehule offered a deep analysis of the pros and cons of the existing cost-based optimizer in PostgreSQL (slides in Czech lang). Pavlo Golub from Cybertec delivered an informative session on programming different PostgreSQL operations with the Go language. In the final talk, Ales Zeleny presented an in-depth analysis of the features of two extensions focused on monitoring query performance in PostgreSQL – pg_stat_statements and pg_stat_monitor.

The conference was really well-organized, and I extend my gratitude to Tomas Vondra and the other organizers for their hard work. Overall, it was very successful event, filled with valuable insights, engaging discussions, and networking opportunities. I look forward to applying the knowledge gained here and I am eager for future editions of the P2D2 conference.

Useful links:

  1. Prague PostgreSQL Developer Day 2024 talks – summaries with links to slides (English / Czech)
  2. Prague PostgreSQL Developer Day 2024 conference recap – Tomas Vondra (English)

Photos by Tomas Vondra (EDB):

 

Overview

Tables that are created and dropped on demand, whether they are temporary or regular, are frequently used by application developers in PostgreSQL to simplify the implementation of various functionalities and to expedite responses. Numerous articles on the internet describe the advantages of using such tables for storing search results, precalculating figures for reports, importing data from external files, and more. One can even define a TEMP TABLE with the condition ON COMMIT DROP, allowing the system to clean up automatically. However, like most things, this solution has potential drawbacks, because size matters. A solution that functions smoothly for dozens of parallel sessions may suddenly begin to cause unexpected issues if the application is used by hundreds or thousands of users simultaneously during peak hours. Frequently creating and dropping tables and related objects, can cause significant bloat of certain PostgreSQL system tables. This is a well-known problem that many articles mention, but they often lack detailed explanations and quantification of the impact. Several pg_catalog system tables can become significantly bloated. Table pg_attribute is the most affected, followed by pg_attrdef and pg_class.

What is the main issue with the bloating of system tables?

We already encountered this issue in the PostgreSQL logs of one of our clients. When the bloat of system tables became too extensive, PostgreSQL decided to reclaim free space during an autovacuum operation. This action caused exclusive locks on the table and blocked all other operations for several seconds. PostgreSQL was unable to read information about the structures of all relations. And as a result, even the simplest select operations had to be delayed until the operation was resolved. This is, of course, an extreme and rare scenario that can only occur under exceptionally high load. Nevertheless, it’s important to be aware of it and be able to assess if it could also happen to our database.

Example of reporting table in accounting software

Let’s examine the impact of these short-lived relations on PostgreSQL system tables using two different examples. The first is a comprehensive example of TEMP TABLE where we will explain all the details, and the second is for benchmarking purposes. Our first example involves an imaginary accounting software that generates a wide variety of reports, many of which require some precalculation of results. The use of temporary tables for these purposes is a fairly obvious design choice. We will discuss one such example — a temporary pivot table for a report storing monthly summaries for an entire year, with one row per client_id:
CREATE TEMP TABLE pivot_temp_table (
   id serial PRIMARY KEY,
   inserted_at timestamp DEFAULT current_timestamp,
   client_id INTEGER,
   name text NOT NULL,
   address text NOT NULL,
   loyalty_program BOOLEAN DEFAULT false,
   loyalty_program_start TIMESTAMP,
   orders_202301_count_of_orders INTEGER DEFAULT 0,
   orders_202301_total_price NUMERIC DEFAULT 0,
   ...
   orders_202312_count_of_orders INTEGER DEFAULT 0,
   orders_202312_total_price NUMERIC DEFAULT 0);
We also want to create some indexes because some results can be quite huge:
CREATE INDEX pivot_temp_table_idx1 ON pivot_temp_table (client_id);
CREATE INDEX pivot_temp_table_idx2 ON pivot_temp_table (name);
CREATE INDEX pivot_temp_table_idx3 ON pivot_temp_table (loyalty_program);
CREATE INDEX pivot_temp_table_idx4 ON pivot_temp_table (loyalty_program_start);
Summary of the created objects:
  • A temporary table, pivot_temp_table, with 31 columns, 27 of which have default values.
  • Some of the columns are of the TEXT data type, resulting in the automatic creation of a TOAST table.
  • The TOAST table requires an index on chunk_id and chunk_seq.
  • The ID is the primary key, meaning a unique index on ID was automatically created.
  • The ID is defined as SERIAL, leading to the automatic creation of a sequence, which is essentially another table with a special structure.
  • We also defined four additional indexes on our temporary table.

Let’s now examine how these relations are represented in PostgreSQL system tables.

Table pg_attribute

The pg_attribute table stores the attributes (columns) of all relations. PostgreSQL will insert a total of 62 rows into the pg_attribute table:
  • Each row in our pivot_temp_table contains six hidden columns (tableoid, cmax, xmax, cmin, xmin, ctid) and 31 ‘normal’ column. This totals to 37 rows inserted for the main temp table.
  • Indexes will add one row for each column used in the index, equating to five rows in our case.
  • A TOAST table was automatically created. It has six hidden columns and three normal columns (chunk_id, chunk_seq, chunk_data), and one index on chunk_id and chunk_seq, adding up to 11 rows in total.
  • A sequence for the ID was created, which is essentially another table with a predefined structure. It has six hidden columns and three normal columns (last_value, log_cnt, is_called), adding another nine rows.

Table pg_attrdef

The pg_attrdef table stores default values for columns. Our main table contains many default values, resulting in the creation of 27 rows in this table. We can examine their content using a query:
SELECT
   c.relname as table_name,
   o.rolname as table_owner,
   c.relkind as table_type,
   a.attname as column_name,
   a.attnum as column_number,
   a.atttypid::regtype as column_data_type,
   pg_get_expr(adbin, adrelid) as sql_command
FROM pg_attrdef ad
JOIN pg_attribute a ON ad.adrelid = a.attrelid AND ad.adnum = a.attnum
JOIN pg_class c ON c.oid = ad.adrelid
JOIN pg_authid o ON o.oid = c.relowner
WHERE c.relname = 'pivot_temp_table'
ORDER BY table_name, column_number;
Our output:
    table_name    | table_owner | table_type |         column_name           | column_number |     column_data_type        | sql_command
------------------+-------------+------------+-------------------------------+---------------+-----------------------------+----------------------------------------------
 pivot_temp_table | postgres    | r          | id                            | 1             | integer                     | nextval('pivot_temp_table_id_seq'::regclass)
 pivot_temp_table | postgres    | r          | inserted_at                   | 2             | timestamp without time zone | CURRENT_TIMESTAMP
 pivot_temp_table | postgres    | r          | loyalty_program               | 6             | boolean                     | false
 pivot_temp_table | postgres    | r          | orders_202301_count_of_orders | 8             | integer                     | 0
 pivot_temp_table | postgres    | r          | orders_202301_total_price     | 9             | numeric                     | 0
--> up to the column "orders_202312_total_price"

Table pg_class

The pg_class table stores primary information about relations. This example will create nine rows: one for the temp table, one for the toast table, one for the toast table index, one for the ID primary key unique index, one for the sequence, and four for the custom indexes.

Summary of this example

Our first example produced a seemingly small number of rows – 62 in pg_attribute, 27 in pg_attrdef, and 9 in pg_class. These are very low numbers, and if such a solution was used by only one company, we would hardly see any problems. But consider a scenario where a company hosts accounting software for small businesses and hundreds or even thousands of users use the app during peak hours. In such a situation, many temp tables and related objects would be created and dropped at a relatively quick pace. In the pg_attribute table, we could see anywhere from a few thousand to even hundreds of thousands of records inserted and then deleted over several hours. However, this is still a relatively small use case. Let’s now imagine and benchmark something even larger.

Example of online shop

Let’s conduct deeper analysis using a more relatable and heavier example. Imagine an online retailer selling clothing, shoes, and other accessories. When a user logs into the shop, the database automatically creates some user-specific tables. These are later deleted by a dedicated process after a certain period of user inactivity. These relations are created to speed up the system’s responses to a specific user. Repeated selects from the main tables would be much slower, even though the main tables are partitioned by days, these partitions can be enormous. For our example, we don’t need to discuss the layout of sessions, nor whether the tables are created as temporary or regular ones, as both have the same impact on PostgreSQL system tables. We will also omit all other aspects of real-life implementation. This example is purely theoretical, inspired by design patterns discussed on the internet, and is not based on any real system. It should not be understood as a design recommendation. In fact, as we will see, this example would more likely serve as an anti-pattern.
  1. The “session_events” table stores selected actions performed by the user during the session. Events are collected for each action the user takes on the website, so there are at least hundreds, but more often thousands of events recorded from one session. These are all sent in parallel into the main event table. However, the main table is enormous. Therefore, this user-specific table stores only some events, allowing for quick analysis of recent activities, etc. The table has 25 different columns, some of which are of the TEXT type and one column of the JSONB type – which means a TOAST table with one index was created. The table has a primary key of the serial type, indicating the order of actions – i.e., one unique index, one sequence, and one default value were created. There are no additional default values. The table also has three additional indexes for quicker access, each on one column. Their benefit could be questionable, but they are part of the implementation.
    • Summary of rows in system tables – pg_attribute – 55 rows, pg_class – 8 rows, pg_attrdef – 1 row
  2. The “last_visited” table stores a small subset of events from the “session_events” table to quickly show which articles the user has visited during this session. Developers chose to implement it this way for convenience. The table is small, containing only 10 columns, but at least one is of the TEXT type. Therefore, a TOAST table with one index was created. The table has a primary key of the TIMESTAMP type, therefore it has one unique index, one default value, but no sequence. There are no additional indexes.
    • Rows in system tables – pg_attribute – 28 rows, pg_class – 4 rows, pg_attrdef – 1 row
  3. The “last_purchases” table is populated at login from the main table that stores all purchases. This user-specific table contains the last 50 items purchased by the user in previous sessions and is used by the recommendation algorithm. This table contains fully denormalized data to simplify their processing and visualization, and therefore it has 35 columns. Many of these columns are of the TEXT type, so a TOAST table with one index was created. The primary key of this table is a combination of the purchase timestamp and the ordinal number of the item in the order, leading to the creation of one unique index but no default values or sequences. Over time, the developer created four indexes on this table for different sorting purposes, each on one column. The value of these indexes can be questioned, but they still exist.
    • Rows in system tables – pg_attribute – 57 rows, pg_class – 8 rows
  4. The “selected_but_not_purchased” table is populated at login from the corresponding main table. It displays the last 50 items still available in the shop that the user previously considered purchasing but later removed from the cart or didn’t finish ordering at all, and the content of the cart expired. This table is used by the recommendation algorithm and has proven to be a successful addition to the marketing strategy, increasing purchases by a certain percentage. The table has the same structure and related objects as “last_purchases”. Data are stored separately from purchases to avoid mistakes in data interpretation and also because this part of the algorithm was implemented much later.
    • Rows in system tables – pg_attribute – 57 rows, pg_class – 8 rows
  5. The “cart_items” table stores items selected for purchase in the current session but not yet bought. This table is synchronized with the main table, but a local copy in the session is also maintained. The table contains normalized data, therefore it has only 15 columns, some of which are of the TEXT type, leading to the creation of a TOAST table with one index. It has a primary key ID of the UUID type to avoid collisions across all users, resulting in the creation of one unique index and one default value, but no sequence. There are no additional indexes.
    • Rows in system tables – pg_attribute – 33 rows, pg_class – 4 rows, pg_attrdef – 1 row

The creation of all these user-specific tables results in the insertion of the following number of rows into PostgreSQL system tables – pg_attribute: 173 rows, pg_class: 32 rows, pg_attrdef: 3 rows.

Analysis of traffic

As the first step we provide an analysis of the business use case and traffic seasonality. Let’s imagine our retailer is active in several EU countries and targets mainly people from 15 to 35 years old. The online shop is relatively new, so it currently has 100,000 accounts. Based on white papers available on the internet, we can presume the following user activity:

Level of user’s activityRatio of users [%]Total count of usersFrequency of visits on page
very active10%10,0002x to 4x per week
normal activity30%30,000~1 time per week
low activity40%40,0001x to 2x per month
almost no activity20%20,000few times in year

Since this is an online shop, traffic is highly seasonal. Items are primarily purchased by individuals for personal use. Therefore, during the working day, they check the shop at very specific moments, such as during travel or lunchtime. The main peak in traffic during the working day is between 7pm and 9pm. Fridays usually have much lower traffic, and the weekend follows suit. The busiest days are generally at the end of the month, when people receive their salaries. The shop experiences the heaviest traffic on Thanksgiving Thursday and Black Friday. The usual practice in recent years is to close the shop for an hour or two and then reopen at a specific hour with reduced prices. Which translates into huge number of relations being created and later deleted at relatively short time. The duration of a user’s connection can range from just a few minutes up to half an hour. User-specific tables are created when user logs into shop. They are later deleted by a special process that uses a sophisticated algorithm to determine whether relations already expired or not. This process involves various criteria and runs at distinct intervals, so we can see a large number of relations deleted in one run. Let’s quantify these descriptions:

Traffic on different daysLogins per 30 minpg_attribute [rows]pg_class [rows]pg_attrdef [rows]
Numbers from analysis per 1 user1173323
Average traffic in the afternoon1,000173,00032,0003,000
Normal working day evening top traffic3,000519,00096,0009,000
Evening after salary low traffic8,0001,384,000256,00024,000
Evening after salary high traffic15,0002,595,000480,00045,000
Singles’ Day evening opening40,0006,920,0001,280,000120,000
Thanksgiving Thursday evening opening60,00010,380,0001,920,000180,000
Black Friday evening opening50,0008,650,0001,600,000150,000
Black Friday weekend highest traffic20,0003,460,000640,00060,000
Theoretical maximum – all users connected100,00017,300,0003,200,000300,000

Now we can see what scalability means. Our solution will definitely work reasonably on normal days. However, traffic in the evenings after people receive their salaries can be very heavy. Thanksgiving Thursday and Black Friday really test its limits. Between 1 and 2 million user-specific tables and related objects will be created and deleted during these evenings. And what happens if our shop becomes even more successful and the number of accounts grows to 500 000, 1 million or more? The solution would definitely hit the limits of vertical scaling at some points, and we would need to think about ways to scale it horizontally.

How to examine bloat

Analysis of traffic provided some theoretical numbers. But we need to check the real-time situation in our database. First, if we’re unsure about what’s happening in our system regarding the creation and deletion of relations, we can temporarily switch on extended logging. We can set ‘log_statements’ to at least ‘ddl’ to see all CREATE/ ALTER /DROP commands. To monitor long running vacuum actions we can set ‘log_autovacuum_min_duration’ to some reasonable low number like 2 seconds. These settings are both dynamic and do not require a restart. However, this change may increase disk IO on local servers due to the increased writes into PostgreSQL logs. On cloud databases or Kubernetes clusters, log messages are usually sent to a separate subsystem and stored independently of the database disk, so the impact should be minimal. To check existing bloats in PostgreSQL tables, we can use the ‘pgstattuple’ extension. This extension only creates new functions; it does not influence the performance of the database. It can only cause reads when we invoke some of its functions. By using its functions in combination with results from other PostgreSQL system objects, we can get a better picture of the bloat in the PostgreSQL system tables. The pg_relation_size function was added to double-check the numbers from the pgstattuple function.
WITH tablenames AS (SELECT tablename FROM (VALUES('pg_attribute'),('pg_attrdef'),('pg_class')) as t(tablename))
SELECT
   tablename,
   now() as checked_at,
   pg_relation_size(tablename) as relation_size,
   pg_relation_size(tablename) / (8*1024) as relation_pages,
   a.*,
   s.*
FROM tablenames t
JOIN LATERAL (SELECT * FROM pgstattuple(t.tablename)) s ON true
JOIN LATERAL (SELECT last_autovacuum, last_vacuum, last_autoanalyze, last_analyze, n_live_tup, n_dead_tup
FROM pg_stat_all_tables WHERE relname = t.tablename) a ON true
ORDER BY tablename
We will get output like this one (result is shown only for 1 table)
 tablename         | pg_attribute
 checked_at        | 2024-02-18 10:46:34.348105+00
 relation_size     | 44949504
 relation_pages    | 5487
 last_autovacuum   | 2024-02-16 20:07:15.7767+00
 last_vacuum       | 2024-02-16 20:55:50.685706+00
 last_autoanalyze  | 2024-02-16 20:07:15.798466+00
 last_analyze      | 2024-02-17 22:05:43.19133+00
 n_live_tup        | 3401
 n_dead_tup        | 188221
 table_len         | 44949504
 tuple_count       | 3401
 tuple_len         | 476732
 tuple_percent     | 1.06
 dead_tuple_count  | 107576
 dead_tuple_len    | 15060640
 dead_tuple_percent| 33.51
 free_space        | 28038420
 free_percent      | 62.38
If we attempt some calculations, we’ll find that the summary of numbers from the pgstattuple function does not match the total relation size. Also, the percentages usually don’t add up to 100%. We need to understand these values as estimates, but they still provide a good indication of the scope of the bloat. We can easily modify this query for monitoring purposes. We should certainly monitor at least the relation_size, n_live_tup, and n_dead_tup for these system tables. To run monitoring under a non-superuser account, this account must have been granted or inherited PostgreSQL predefined roles ‘pg_stat_scan_tables’ or ‘pg_monitor’. If we want to dig deeper into the problem and make some predictions, we can, for example, check how many tuples are stored per page in a specific table. With these numbers, we would be able to estimate possible bloat in critical moments. We can use a query like this one:
WITH pages AS (
   SELECT * FROM generate_series(0, (SELECT pg_relation_size('pg_attribute') / 8192) -1) as pagenum),
tuples_per_page AS (
   SELECT pagenum, nullif(sum((t_xmin is not null)::int), 0) as tuples_per_page
   FROM pages JOIN LATERAL (SELECT * FROM heap_page_items(get_raw_page('pg_attribute',pagenum))) a ON true
   GROUP BY pagenum)
SELECT
   count(*) as pages_total,
   min(tuples_per_page) as min_tuples_per_page,
   max(tuples_per_page) as max_tuples_per_page,
   round(avg(tuples_per_page),0) as avg_tuples_per_page,
   mode() within group (order by tuples_per_page) as mode_tuples_per_page
FROM tuples_per_page
Output will look like this:
 pages_total          | 5487
 min_tuples_per_page  | 1
 max_tuples_per_page  | 55
 avg_tuples_per_page  | 23
 mode_tuples_per_page | 28

Here, we can see that in our pg_attribute system table, we have an average of 23 tuples per page. So now we can calculate theoretical increase in size of this table for different traffic. Typical size of this table is usually only few hundreds of MBs. So theoretical bloat about 3 GB during Black Friday days is quite significant number for this table.

Loginspg_attribute rowsdata pagessize in MB
117380.06
1,000173,0007,52258.77
3,000519,00022,566176.30
15,0002,595,000112,827881.46
20,0003,460,000150,4351,175.27
60,00010,380,000451,3053,525.82
100,00017,300,000752,1745,876.36

Summary

We’ve presented a reporting example from accounting software and an example of user-specific tables from an online shop. While both are theoretical, the idea is to illustrate patterns. We also discussed the influence of high traffic seasonality on the number of inserts and deletes in system tables. We provided an example of an extremely increased load in an online shop on big sales days. We believe the results of the analysis warrant attention. It’s also important to remember that the already heavy situation in these peak moments can be even more challenging if our application is running on an instance with low disk IOPS. All these new objects would cause writes into WAL logs and synchronization to the disk. In the case of low disk throughput, there could be significant latency, and many operations could be substantially delayed. So, what’s the takeaway from this story? First of all, PostgreSQL autovacuum processes are designed to minimize the impact on the system. If the autovacuum settings on our database are well-tuned, in most cases, we won’t see any problems. However, if these settings are outdated, tailored for much lower traffic, and our system is under unusually heavy load for an extended period, creating and dropping thousands of tables and related objects in a relatively short time, PostgreSQL system tables can eventually become significantly bloated. This will already slow down system queries reading details about all other relations. And at some point, the system could decide to shrink these system tables, causing an Exclusive lock on some of these relations for seconds or even dozens of seconds. This could block a large number of selects and other operations on all tables. Based on analysis of traffic, we can conduct a similar analysis for other specific systems to understand when they will be most susceptible to such incidents. But having effective monitoring is absolutely essential.

Resources

  1. Understanding an outage: concurrency control & vacuuming in PostgreSQL
  2. Stackoverflow – temporary tables bloating pg_attribute
  3. Diagnosing table and index bloat
  4. What are the peak times for online shopping?
  5. PostgreSQL Tuple-Level Statistics With pgstattuple

The PostgreSQL 2024Q1 back-branch releases 16.2, 15.6, 14.11, 13.14 and 12.18 on February 8th 2024. Besides fixing a security issue (CVE-2024-0985) and the usual bugs, they are somewhat unique in that they address two performance problems by backporting fixes already introduced into the master branch before. In this blog post, we describe two quick benchmarks that show how the new point releases have improved. The benchmarks were done on a ThinkPad T14s Gen 3 which has a Intel i7-1280P CPU with 20 cores and 32 GB of RAM.

Scalability Improvements During Heavy Contention

The performance improvements in the 2024Q1 point releases concerns locking scalability improvements at high client counts, i.e., when the system is under heavy contention. Benchmarks had shown that the performance was getting worse dramatically for a pgbench run with more than 128 clients. The original commit to master (which subsequently was released with version 16) is from November 2022. It got introduced into the back-branches now as version 16 has seen some testing and the results were promising.

The benchmark we used is adapted from this post by the patch author and consists of a tight pgbench run simply executing SELECT txid_current() for five seconds each at increasing client count and measuring the transactions per second:

$ cat /tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 96 128 192 256 384 512 768 1024 1536;
> do echo -n "$c ";pgbench -n -M prepared -f /tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';
> done

The following graph shows the average transactions per second (tps) over 3 runs with increasing client count (1-1536 clients), using the Debian 12 packages for version 15, comparing the 2023Q4 release (15.5, package postgresql-15_15.5-0+deb12u1) with the 2024Q1 release (15.6, package postgresql-15_15.6-0+deb12u1):

The tps numbers are basically the same up to 128 clients, whereas afterwards the 15.5 transaction counts drops from the peak of 650k 10-fold to 65k. The new 15.6 release maintains the transaction count much better and still maintains around 300k tps at the 1536 clients, which is a 4.5-fold increase of the 2024Q1 release compared to previously.

This benchmark is of course a best-case, artificial scenario, but it shows that the latest point release of Postgres can improve scalability dramatically for heavily contested locking scenarios.

JIT Memory Consumption Improvements

JIT (just-in-time compilation with LLVM) was introduced in version 11 of Postgres and made the default in version 13. For a long time now, it has been known that long-running PostgreSQL sessions that run JIT queries repeatedly leak memory. There have been several bug reports about this, including some more in the Debian bug tracker and probably elsewhere.

This has been diagnosed to be due to JIT inlining and a work-around is setting jit_inline_above_cost to -1 from the default value of 500000. However, this disables JIT inlining completely. The 2024Q1 back-branch releases contain a backport of a change that will go into version 17: after every 100 queries, the LLVM caches are dropped and recreated, plugging the memory leak.

To show how the memory consumption has improved, we use the test case from this bug report. The benchmark is prepared as followed:

CREATE TABLE IF NOT EXISTS public.leak_test
(
   id integer NOT NULL,
   CONSTRAINT leak_test_pkey PRIMARY KEY (id)
);

INSERT INTO leak_test(id)
   SELECT id
   FROM generate_series(1,100000) id
ON CONFLICT DO NOTHING;

Then, the process ID of the backend is noted and the SQL query mentioned in the bug report run 5000 times in a loop:

=> SELECT pg_backend_pid();
 pg_backend_pid
----------------
         623404

=> DO $$DECLARE loop_cnt integer;
-> BEGIN
->   loop_cnt := 5000;
->   LOOP
->     PERFORM
->       id,
->       (SELECT count(*) FROM leak_test x WHERE x.id=l.id) as x_result,
->       (SELECT count(*) FROM leak_test y WHERE y.id=l.id) as y_result
->       /* Leaks memory around 80 kB on each query, but only if two sub-queries are used. */
->     FROM leak_test l;
->     loop_cnt := loop_cnt - 1;
->     EXIT WHEN loop_cnt = 0;
->   END LOOP;
-> END$$;

During this the memory consumption of the Postgres backend is recorded via pidstat:

pidstat -r -hl -p 623404 2 | tee -a leak_test.log.15.6
Linux 6.1.0-18-amd64 (mbanck-lin-0.credativ.de)     15.02.2024  _x86_64_    (20 CPU)

# Time        UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
12:48:56      118    623404      0,00      0,00  381856   91504   0,28  postgres: 15/main: postgres postgres [local] SELECT
12:48:58      118    623404      0,00      0,00  381856   91504   0,28  postgres: 15/main: postgres postgres [local] SELECT
12:49:00      118    623404      0,00      0,00  381856   91504   0,28  postgres: 15/main: postgres postgres [local] SELECT
12:49:02      118    623404      0,00      0,00  381856   91504   0,28  postgres: 15/main: postgres postgres [local] SELECT
12:49:04      118    623404   7113,00      0,00  393632  109252   0,34  postgres: 15/main: postgres postgres [local] SELECT
12:49:06      118    623404  13219,00      0,00  394556  109508   0,34  postgres: 15/main: postgres postgres [local] SELECT
12:49:08      118    623404  14376,00      0,00  395384  108228   0,33  postgres: 15/main: postgres postgres [local] SELECT
[...]

The benchmark are again repeated for the 15.5 and 15.6 Debian 12 packages (which are both linked against LLVM-14) and the RSS memory consumption as reported by pidstat is plotted against time:

While the memory consumption of the 15.5 session increases linearly over time from 100 to 600 MB, it stays more or less constant at around 100 MB for 15.6. This is a great improvement that will make JIT much more usable for larger installations with long running sessions where so far the usual recommendation has been to disable JIT entirely.

Conclusion

The 2024Q1 patch release has important performance improvements for lock scalability and JIT memory consumption that we have demonstrated in this blog post. Furthermore, the patch release contains other important bug fixes and a security fix for CVE-2024-0985. This security issue is limited to materialized views and a admin user needs to be tricked into recreating a malicious materialized view on behalf of an attacker. But it has seen some german press coverage so quite a few of our customers were especially made aware of it and asked us to assist them with their minor upgrades. In general, Postgres patch releases are low-risk and unintrusive (just install the updated packages and restart the Postgres instances if the package did not do this itself) so that they should always be deployed as soon as possible.

On June 27, it was that time again: we were able to attend PGConf.DE 2023. This year, the event was held at the Haus der Technik in Essen, marking the conference’s return to the Ruhr region for the first time since 2013. In addition to the usual excellent talks, catering, and atmosphere, a new visitor record was set with approximately 250 participants, making this year’s edition the largest German PostgreSQL event to date. Instaclustr was also represented for the first time this year as a gold sponsor with a booth. This allowed interested parties to learn about Instaclustr Managed Services and other offerings. Admission was at 8 a.m., and after setting up the booth, some initial small talk, and a welcome speech, the first talks started right on time at 9:10 a.m. The conference offered three parallel talks, allowing attendees to choose speakers and topics of interest in each slot. One of the speakers was our senior consultant, Michael Banck. He gave a presentation on secure PostgreSQL operation in accordance with BSI basic protection.

Michael Banck before his presentation: Secure PostgreSQL operation in accordance with BSI basic protection

Topics such as performance optimization, high availability (with Patroni, for example), and technical operations were among those on offer. There were also some very entertaining reports from the world of PostgreSQL consultants. My personal highlight was Laurenz Albe’s talk on data corruption. He presented the topic in a very practical and clear way and also showed very clear rules and ways to best deal with such a situation. PGConf.DE 2023 is a great event for training and networking. We are already looking forward to next year!

Moodle is a popular Open Source online learning platform. Especially since the beginning of the COVID-19 pandemic the importance of Moodle for schools and universities has further increased. In some states in Germany all schools had to switch to Moodle and other platforms like BigBlueButton in the course of a few days. This leads to scalability problems if suddenly several tens of thousands of pupils need to access Moodle.
Besides scaling the Moodle application itself, the database needs to be considered as well. One of the database options for Moodle is PostgreSQL. In this blog post, we present load-balancing options for Moodle using PostgreSQL.

High-Availability via Patroni

An online learning platform can be considered critical infrastructure from the point of view of the educational system and should be made highly available, in particular the database. A good solution for PostgreSQL is Patroni, we reported on its Debian-integration in the past.

In short, Patroni uses a distributed consensus store (DCS) to elect a leader from a typically 3-node cluster or initiate a failover and elect a new leader in the case of a leader failure, without entering a split-brain scenario. In addition, Patroni provides a REST API used for communication among nodes and from the patronictl program, e.g. to change the Postgres configuration online on all nodes or to initiate a switchover.

Client-solutions for high availability

From Moodle’s perspective, however, it must additionally be ensured that it is connected to the leader, otherwise no write transactions are possible. Traditional high-availability solutions such as Pacemaker use virtual IPs (VIPs) here, which are pivoted to the new primary node in the event of a failover. For Patroni there is the vip-manager project instead, which monitors the leader key in the DCS and sets or removes cluster VIP locally. This is also integrated into Debian as well.

An alternative is to use client-side failover based on PostgreSQL’s libpq library. For this, all cluster members are listed in the connection string and the connection option target_session_attrs=read-write is added. Configured this way, if a connection is broken, the client will try to reach the other nodes until a new primary is found.

Another option is HAProxy, a highly scalable TCP/HTTP load balancer. By performing periodic health checks on Patroni‘s REST API of each node, it can determine the current leader and forward client queries to it.

Moodle database configuration

Moodle’s connection to a PostgreSQL database is configured in config.php, e.g. for a simple stand-alone database:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '192.168.1.1';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbport' => '',
  'dbsocket' => ''
);

The default port 5432 is used here.

If streaming replication is used, the standbys can additionally be defined as readonly and assigned to an own database user (which only needs read permissions):

$CFG->dboptions = array (
[...]
  'readonly' => [
    'instance' => [
      [
      'dbhost' => '192.168.1.2',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.3',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ]
    ]
  ]
);

Failover/load balancing with libpq

If a highly available Postgres cluster is used with Patroni, the primary, as described above, can be switched to prevent loss of data or shutdown of the system, in case of a failover or switchover incident. Moodle does not provide a way to set generic database options here and thus setting target_session_attrs=read-write directly is not possible. Therefore we developed a patch for this and implemented it in the Moodle tracker. This allows the additional option 'dbfailover' => 1, in the $CFG->dboptions array, which adds the necessary connection option target_session_attrs=read-write. A customized config.php would look like this:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '192.168.1.1,192.168.1.2,192.168.1.3';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbfailover' => 1,
  'dbport' => '',
  'dbsocket' => '',
  'readonly' => [
    'instance' => [
      [
      'dbhost' => '192.168.1.1',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.2',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.3',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ]
    ]
  ]
);

Failover/load balancing with HAProxy

If HAProxy is to be used instead, then $CFG->dbhost must be set to the HAProxy host e.g. 127.0.0.1 in case HAProxy is running locally on the Moodle server(s). Moreover a second port (e.g. 65432) can be defined for read queries, which is configured as readonly in $CFG->dboptions, same as the streaming replication standby above. The config.php would then look like this:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '127.0.0.1';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbport' => '',
  'dbsocket' => '',
  'readonly' => [
    'instance' => [
      'dbhost' => '127.0.0.1',
      'dbport' =>  '65432',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
    ]
  ]
);

The HAProxy configuration file haproxy.cfg can look like the following example:

global
    maxconn 100

defaults
    log global
    mode tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen postgres_write
    bind *:5432
    mode tcp
    option httpchk
    http-check expect status 200
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 maxconn 100 check port 8008
    server pg2 192.168.1.2:5432 maxconn 100 check port 8008
    server pg3 192.168.1.3:5432 maxconn 100 check port 8008

HAProxy expects incoming write connections (postgres_write) on port 5432 and forwards them to port 5432 of the cluster members. The primary is determined by an HTTP check on port 8008 (the default Patroni REST API port); Patroni returns status 200 here for the primary and status 503 for standbys.

For read queries (postgres_read), it must be decided whether the primary should also serve read-only queries or not. If this is the case, a simple Postgres check (pgsql-check) can be used; however, this may lead to entries in the PostgreSQL log regarding incorrect or incomplete logins:

listen postgres_read
    bind *:65432
    mode tcp
    balance leastconn
    option pgsql-check user haproxy
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 check
    server pg2 192.168.1.2:5432 check
    server pg3 192.168.1.3:5432 check

If you don’t want the primary to participate in the read scaling you can simply use the same HTTP check as in the postgres_write section, this time expecting HTTP status 503:

listen postgres_read
    bind *:65432
    mode tcp
    balance leastconn
    option httpchk
    http-check expect status 503
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 check port 8008
    server pg2 192.168.1.2:5432 check port 8008
    server pg3 192.168.1.3:5432 check port 8008

Revised Ansible playbook

HAProxy support has also been implemented in version 0.3 of our Ansible playbooks for automated setup of a three-node PostgreSQL Patroni cluster on Debian. The new variable haproxy_primary_read_scale can be used to decide whether HAProxy should also issue requests on the read-only port to the primary node or only to the followers.

We are happy to help!

Whether it’s PostgreSQL, Patroni, HAProxy, Moodle, or any other open source software; with over 22+ years of development and service experience in the open source space, credativ GmbH can assist you with unparalleled and individually customizable support. We are there to help and assist you in all your open source infrastructure needs – if desired 24 hours a day, 365 days a year!

We look forward to hearing from you.

SQLreduce: Reduce verbose SQL queries to minimal examples

Developers often face very large SQL queries that raise some errors. SQLreduce is a tool to reduce that complexity to a minimal query.

SQLsmith generates random SQL queries

SQLsmith is a tool that generates random SQL queries and runs them against a PostgreSQL server (and other DBMS types). The idea is that by fuzz-testing the query parser and executor, corner-case bugs can be found that would otherwise go unnoticed in manual testing or with the fixed set of test cases in PostgreSQL’s regression test suite. It has proven to be an effective tool with over 100 bugs found in different areas in the PostgreSQL server and other products since 2015, including security bugs, ranging from executor bugs to segfaults in type and index method implementations. For example, in 2018, SQLsmith found that the following query triggered a segfault in PostgreSQL:

select
  case when pg_catalog.lastval() < pg_catalog.pg_stat_get_bgwriter_maxwritten_clean() then case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       else case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       end as c0,
  case when (select intervalcol from public.brintest limit 1 offset 1)
         >= cast(null as "interval") then case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       else case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       end as c1,
  ref_0.a as c2,
  (select a from public.idxpart1 limit 1 offset 5) as c3,
  ref_0.b as c4,
    pg_catalog.stddev(
      cast((select pg_catalog.sum(float4col) from public.brintest)
         as float4)) over (partition by ref_0.a,ref_0.b,ref_0.c order by ref_0.b) as c5,
  cast(nullif(ref_0.b, ref_0.a) as int4) as c6, ref_0.b as c7, ref_0.c as c8
from
  public.mlparted3 as ref_0
where true;

However, just like in this 40-line, 2.2kB example, the random queries generated by SQLsmith that trigger some error are most often very large and contain a lot of noise that does not contribute to the error. So far, manual inspection of the query and tedious editing was required to reduce the example to a minimal reproducer that developers can use to fix the problem.

Reduce complexity with SQLreduce

This issue is solved by SQLreduce. SQLreduce takes as input an arbitrary SQL query which is then run against a PostgreSQL server. Various simplification steps are applied, checking after each step that the simplified query still triggers the same error from PostgreSQL. The end result is a SQL query with minimal complexity.

SQLreduce is effective at reducing the queries from original error reports from SQLsmith to queries that match manually-reduced queries. For example, SQLreduce can effectively reduce the above monster query to just this:

SELECT pg_catalog.stddev(NULL) OVER () AS c5 FROM public.mlparted3 AS ref_0

Note that SQLreduce does not try to derive a query that is semantically identical to the original, or produces the same query result – the input is assumed to be faulty, and we are looking for the minimal query that produces the same error message from PostgreSQL when run against a database. If the input query happens to produce no error, the minimal query output by SQLreduce will just be SELECT.

How it works

We’ll use a simpler query to demonstrate how SQLreduce works and which steps are taken to remove noise from the input. The query is bogus and contains a bit of clutter that we want to remove:

$ psql -c 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'
ERROR:  column pg_database.reltuples does not exist

Let’s pass the query to SQLreduce:

$ sqlreduce 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'

SQLreduce starts by parsing the input using pglast and libpg_query which expose the original PostgreSQL parser as a library with Python bindings. The result is a parse tree that is the basis for the next steps. The parse tree looks like this:

selectStmt
├── targetList
│   └── /
│       ├── pg_database.reltuples
│       └── 1000
├── fromClause
│   ├── pg_database
│   └── pg_class
├── whereClause
│   └── <
│       ├── 0
│       └── /
│           ├── pg_database.reltuples
│           └── 1000
├── orderClause
│   └── 1
└── limitCount
    └── 10

Pglast also contains a query renderer that can render back the parse tree as SQL, shown as the regenerated query below. The input query is run against PostgreSQL to determine the result, in this case ERROR: column pg_database.reltuples does not exist.

Input query: select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10
Regenerated: SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC LIMIT 10
Query returns: ✔ ERROR:  column pg_database.reltuples does not exist

SQLreduce works by deriving new parse trees that are structurally simpler, generating SQL from that, and run these queries against the database. The first simplification steps work on the top level node, where SQLreduce tries to remove whole subtrees to quickly find a result. The first reduction tried is to remove LIMIT 10:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC ✔

The query result is still ERROR: column pg_database.reltuples does not exist, indicated by a ✔ check mark. Next, ORDER BY 1 is removed, again successfully:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Now the entire target list is removed:

SELECT FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

This shorter query is still equivalent to the original regarding the error message returned when it is run against the database. Now the first unsuccessful reduction step is tried, removing the entire FROM clause:

SELECT WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"

That query is also faulty, but triggers a different error message, so the previous parse tree is kept for the next steps. Again a whole subtree is removed, now the WHERE clause:

SELECT FROM pg_database, pg_class ✘ no error

We have now reduced the input query so much that it doesn’t error out any more. The previous parse tree is still kept which now looks like this:

selectStmt
├── fromClause
│   ├── pg_database
│   └── pg_class
└── whereClause
    └── <
        ├── 0
        └── /
            ├── pg_database.reltuples
            └── 1000

Now SQLreduce starts digging into the tree. There are several entries in the FROM clause, so it tries to shorten the list. First, pg_database is removed, but that doesn’t work, so pg_class is removed:

SELECT FROM pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"
SELECT FROM pg_database WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Since we have found a new minimal query, recursion restarts at top-level with another try to remove the WHERE clause. Since that doesn’t work, it tries to replace the expression with NULL, but that doesn’t work either.

SELECT FROM pg_database ✘ no error
SELECT FROM pg_database WHERE NULL ✘ no error

Now a new kind of step is tried: expression pull-up. We descend into WHERE clause, where we replace A < B first by A and then by B.

SELECT FROM pg_database WHERE 0 ✘ ERROR:  argument of WHERE must be type boolean, not type integer
SELECT FROM pg_database WHERE pg_database.reltuples / 1000 ✔
SELECT WHERE pg_database.reltuples / 1000 ✘ ERROR:  missing FROM-clause entry for table "pg_database"

The first try did not work, but the second one did. Since we simplified the query, we restart at top-level to check if the FROM clause can be removed, but it is still required.

From A / B, we can again pull up A:

SELECT FROM pg_database WHERE pg_database.reltuples ✔
SELECT WHERE pg_database.reltuples ✘ ERROR:  missing FROM-clause entry for table "pg_database"

SQLreduce has found the minimal query that still raises ERROR: column pg_database.reltuples does not exist with this parse tree:

selectStmt
├── fromClause
│   └── pg_database
└── whereClause
    └── pg_database.reltuples

At the end of the run, the query is printed along with some statistics:

Minimal query yielding the same error:
SELECT FROM pg_database WHERE pg_database.reltuples

Pretty-printed minimal query:
SELECT
FROM pg_database
WHERE pg_database.reltuples

Seen: 15 items, 915 Bytes
Iterations: 19
Runtime: 0.107 s, 139.7 q/s

This minimal query can now be inspected to fix the bug in PostgreSQL or in the application.

About credativ

The credativ GmbH is a manufacturer-independent consulting and service company located in Moenchengladbach, Germany. With over 22+ years of development and service experience in the open source space, credativ GmbH can assist you with unparalleled and individually customizable support. We are here to help and assist you in all your open source infrastructure needs.

Since the successful merger with Instaclustr in 2021, credativ GmbH has been the European headquarters of the Instaclustr Group, which helps organizations deliver applications at scale through its managed platform for open source technologies such as Apache Cassandra®, Apache Kafka®, Apache Spark™, Redis™, OpenSearch®, PostgreSQL®, and Cadence.
Instaclustr combines a complete data infrastructure environment with hands-on technology expertise to ensure ongoing performance and optimization. By removing the infrastructure complexity, we enable companies to focus internal development and operational resources on building cutting edge customer-facing applications at lower cost. Instaclustr customers include some of the largest and most innovative Fortune 500 companies.

pg_dirtyread

I recently updated pg_dirtyread to work with PostgreSQL 14. pg_dirtyread is a PostgreSQL extension that allows reading “dead” rows from tables, i.e., rows that have already been deleted or updated. Of course, this only works if the table has not yet been cleaned by a VACUUM command or autovacuum, PostgreSQL’s garbage collection.

Here is an example of pg_dirtyread in action:

# create table foo (id int, t text);
CREATE TABLE
# insert into foo values (1, 'Doc1');
INSERT 0 1
# insert into foo values (2, 'Doc2');
INSERT 0 1
# insert into foo values (3, 'Doc3');
INSERT 0 1

# select * from foo;
 id │ t
────┼──────
 1 │ Doc1
 2 │ Doc2
 3 │ Doc3
(3 rows)

# delete from foo where id < 3;
DELETE 2

# select * from foo;
 id │ t
────┼──────
  3 │ Doc3
(1 row)

Oops! The first two documents are gone.

Let’s now use pg_dirtyread to view the table:

# create extension pg_dirtyread;
CREATE EXTENSION

# select * from pg_dirtyread('foo') t(id int, t text);
  id │ t
────┼──────
  1 │ Doc1
  2 │ Doc2
  3 │ Doc3

All three documents are still present, but only one of them is visible.

pg_dirtyread can also display the system columns of PostgreSQL with information about the position and visibility of the rows. For the first two documents, xmax is set, which means that the row has been deleted:

# select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);

 ctid │ xmin │ xmax │ id │ t
───────┼──────┼──────┼────┼──────
 (0,1) │ 1577 │ 1580 │ 1 │ Doc1
 (0,2) │ 1578 │ 1580 │ 2 │ Doc2
 (0,3) │ 1579 │ 0 │ 3 │ Doc3
(3 rows)

Undelete

Caveat: I cannot promise that any of the ideas listed below will work in practice. There are some uncertainties and a good deal of complicated knowledge about PostgreSQL internals may be required to succeed. I would ask you to consider consulting your preferred PostgreSQL service provider for advice before attempting to restore data on a production system. Do not try this at work!

I always had plans to extend pg_dirtyread with an “Undelete” command to make deleted rows reappear, but unfortunately, I never got around to trying it out in the past. However, rows can already be restored by using the output of pg_dirtyread itself:

# insert into foo select * from pg_dirtyread('foo') t(id int, t text) where id = 1;

However, this is not a real “Undelete” or “undo” – it only inserts new rows from the data read from the table.

pg_surgery

Let’s welcome pg_surgery, a new PostgreSQL extension that comes with PostgreSQL 14. It contains two functions for “surgery on a broken relationship”. As a side effect, they can also make deleted tuples reappear.

As I have now discovered, one of the functions, heap_force_freeze(), works well with pg_dirtyread. It takes a list of ctids (row positions) that it marks as “frozen” but also as “not deleted”.

Let’s apply it to our test table using the ctids that pg_dirtyread can read:

# create extension pg_surgery;
CREATE EXTENSION

# select heap_force_freeze('foo', array_agg(ctid))
    from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text) where id = 1;
 heap_force_freeze
───────────────────

(1 row)

And voilà, our deleted document is back:

# select * from foo;
  id │ t
────┼──────
  1 │ Doc1
  3 │ Doc3
(2 rows)

#select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);
  ctid │ xmin │ xmax │ id │ t
───────┼──────┼──────┼────┼──────
 (0,1) │ 2 │ 0 │ 1 │ Doc1
  (0,2) │ 1578 │ 1580 │ 2 │ Doc2
  (0,3) │ 1579 │ 0 │ 3 │ Doc3
(3 rows)

Disclaimer

Most importantly, none of the above methods will work if the data you just deleted has already been cleaned by VACUUM or Autovacuum. This actively deletes the recovered storage space. Restore your data from the backup.

Since both pg_dirtyread and pg_surgery operate outside of the normal PostgreSQL MVCC machinery, it is easy to create corrupt data with them. This includes duplicate rows, duplicate primary key values, indexes that are not synchronized with the tables, broken foreign key constraints, and others. You have been warned.

Furthermore, pg_dirtyread does not (yet) work if the deleted rows contain any toasted values. Possible other approaches are the use of pageinspect and pg_filedump to get the ctids of the deleted rows. Please make sure you have working backups and do not need any of the above methods.

We are Happy to Support You

The necessity of the above operations can be completely prevented in most cases by professional administration or setup of the PostgreSQL infrastructure by a competent service provider. If you are currently facing one of these problems and are looking for ways out of a predicament, please contact us by email at info@credativ.de or via the contact form.

With over 22+ years of development and service experience in the PostgreSQL and Open Source area, credativ GmbH can professionally accompany you with an unprecedented and individually configurable support and fully support you in all questions regarding your Open Source infrastructure.

 

This article was originally written by Christoph Berg

Also Worth Reading

Congratulations to the Debian Community

The Debian Project just released version 11 (aka “bullseye”) of their free operating system. In total, over 6,208 contributors worked on this release and were indispensable in making this launch happen. We would like to thank everyone involved for their combined efforts, hard work, and many hours pent in recent years building this new release that will benefit the entire open source community.

We would also like to acknowledge our in-house Debian developers who contributed to this effort. We really appreciate the work you do on behalf of the community and stand firmly behind your contributions.

What’s New in Debian 11 Bullseye

Debian 11 comes with a number of meaningful changes and enhancements. The new release includes over 13,370 new software packages, for a total of over 57,703 packages on release. Out of these, 35,532 packages have been updated to newer versions, including an update in the kernel from 4.19 in “buster” to 5.10 in bullseye.

Bullseye expands on the capabilities of driverless printing with Common Unix Printing System (CUPS) and driverless scanning with Scanner Access Now Easy (SANE). While it was possible to use CUPS for driverless printing with buster, bullseye comes with the package ipp-usb, which allows a USB device to be treated as a network device and thus extend driverless printing capabilities. SANE connects to this when set up correctly and connected to a USB port.

As in previous releases, Debian 11 comes with a Debian Edu / Skolelinux version. Debian Edu has been a complete solution for schools for many years. Debian Edu can provide the entire network for a school and then only users and machines need to be added after installation. This can also be easily managed via the web interface GOsa².

Debian 11 bullseye can be downloaded here.
https://www.debian.org/devel/debian-installer/index.en.html

For more information and greater technical detail on the new Debian 11 release, please refer to the official release notes on Debian.org

https://www.debian.org/releases/bullseye/amd64/release-notes/.

Contributions by Instaclustr Employees

Our Debian roots run deep here. credativ, which was acquired by Instaclustr in March 2021, has always been an active part of the Debian community and visited every DebConf since 2004. Debian also serves as the operating system at the heart of the Instaclustr Managed Platform.

For the release of Debian 11, our team has taken over various responsibilities in the community. Our contributions include:

Many of our colleagues have made significant contributions to the current release, including:

How to Upgrade

Given that Debian 11 bullseye is a major release, we suggest that everyone running on Debian 10 buster upgrade. The main steps for an upgrade include:

  1. Make sure to backup any data that should not get lost and prepare for recovery
  2. Remove non-Debian packages and clean up leftover files and old versions
  3. Upgrade to latest point release
  4. Check and prepare your APT source-list files by adding the relevant Internet sources or local mirrors
  5. Upgrade your packages and then upgrade your system

You can find a more detailed walkthrough of the upgrade process in the Debian documentation.

All existing credativ customers who are running a Debian-based installation are naturally covered by our service and support and are encouraged to reach out.

If you are interested in upgrading from your old Debian version, or if you have questions with regards to your Debian infrastructure, do not hesitate to drop us an email or contact us at info@credativ.de.

Or, you can get started in minutes with any one of these open source technologies like Apache Cassandra, Apache Kafka, Redis, and OpenSearch on the Instaclustr Managed Platform. Sign up for a free trial today.

PostgreSQL® is an extremely robust database, to which most of our clients also entrust their data. However, if errors do occur, they are usually due to the storage system, where individual bits or bytes flip, or entire blocks become corrupted. We demonstrate how to recover data from corrupt tables.

In case of an error, the user is confronted with messages that originate from the storage layer or other PostgreSQL® subsystems:

postgres=# select * from t;
ERROR:  missing chunk number 0 for toast value 192436 in pg_toast_192430

postgres=# select * from a;
ERROR:  invalid memory alloc request size 18446744073709551613

If only individual tuples are corrupt, one can partially help by reading them out individually, e.g., according to id, which often does not help further:

select * from t where id = 1;

It is more promising to address the tuples directly by their internal tuple ID, called in PostgreSQL® ctid:

select * from t where ctid = '(0,1)';

To read out all tuples, we use a loop in plpgsql:

for page in 0 .. pages-1 loop
  for item in 1 .. ??? loop
     select * from t where ctid = '('||page||','||item||')' into r;
     return next r;
  end loop;
end loop;

We still need the number of pages in the table, which we get from pg_relation_size(), and the number of tuples on the page, for which we utilize the pageinspect extension.

select pg_relation_size(relname) / current_setting('block_size')::int into pages;

for page in 0 .. pages-1 loop
  for item in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
    SELECT * FROM t WHERE ctid=item into r;
    if r is not null then
      return next r;
    end if;
  end loop;
end loop;

Now comes the most important part: Accessing the damaged tuples or pages causes errors that we must catch with a begin..exception..end block. We pass the error messages to the user as . Furthermore, the function should not only work for one table but also receive a parameter . The entire plpgsql function then looks like this:

create extension pageinspect;

create or replace function read_table(relname regclass)
returns setof record
as $$
declare
  pages int;
  page int;
  ctid tid;
  r record;
  sql_state text;
  error text;
begin
  select pg_relation_size(relname) / current_setting('block_size')::int into pages;

  for page in 0 .. pages-1 loop

    begin

      for ctid in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
        begin
          execute format('SELECT * FROM %s WHERE ctid=%L', relname, ctid) into r;
          if r is not null then
            return next r;
          end if;
        exception -- bad tuple
          when others then
            get stacked diagnostics sql_state := RETURNED_SQLSTATE;
            get stacked diagnostics error := MESSAGE_TEXT;
            raise notice 'Skipping ctid %: %: %', ctid, sql_state, error;
        end;
      end loop;

    exception -- bad page
      when others then
        get stacked diagnostics sql_state := RETURNED_SQLSTATE;
        get stacked diagnostics error := MESSAGE_TEXT;
        raise notice 'Skipping page %: %: %', page, sql_state, error;
    end;

  end loop;
end;
$$ language plpgsql;

Since the function returns “record”, the table signature must be provided during the call:

postgres =# select * from read_table('t') as t(t text);
NOTICE:  Skipping ctid (0,1): XX000: missing chunk number 0 for toast value 192436 in pg_toast_192430
       t
───────────────
 one
 two
 three
...

An alternative variant writes the read data directly into a new table:

postgres =# select rescue_table('t');
NOTICE:  t: page 0 of 1
NOTICE:  Skipping ctid (0,1): XX000: missing chunk number 0 for toast value 192436 in pg_toast_192430
                                    rescue_table
─────────────────────────────────────────────────────────────────────────────────────
 rescue_table t into t_rescue: 0 of 1 pages are bad, 1 bad tuples, 100 tuples copied
(1 row)

The table t_rescue was created automatically.

create extension pageinspect;

create or replace function rescue_table(relname regclass, savename name default null, "create" boolean default true)
returns text
as $$
declare
  pages int;
  page int;
  ctid tid;
  row_count bigint;
  good_tuples bigint := 0;
  bad_pages bigint := 0;
  bad_tuples bigint := 0;
  sql_state text;
  error text;
begin
  if savename is null then
    savename := relname || '_rescue';
  end if;
  if rescue_table.create then
    execute format('CREATE TABLE %s (LIKE %s)', savename, relname);
  end if;

  select pg_relation_size(relname) / current_setting('block_size')::int into pages;

  for page in 0 .. pages-1 loop
    if page % 10000 = 0 then
      raise notice '%: page % of %', relname, page, pages;
    end if;

    begin

      for ctid in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
        begin
          execute format('INSERT INTO %s SELECT * FROM %s WHERE ctid=%L', savename, relname, ctid);
          get diagnostics row_count = ROW_COUNT;
          good_tuples := good_tuples + row_count;
        exception -- bad tuple
          when others then
            get stacked diagnostics sql_state := RETURNED_SQLSTATE;
            get stacked diagnostics error := MESSAGE_TEXT;
            raise notice 'Skipping ctid %: %: %', ctid, sql_state, error;
            bad_tuples := bad_tuples + 1;
        end;
      end loop;

    exception -- bad page
      when others then
        get stacked diagnostics sql_state := RETURNED_SQLSTATE;
        get stacked diagnostics error := MESSAGE_TEXT;
        raise notice 'Skipping page %: %: %', page, sql_state, error;
        bad_pages := bad_pages + 1;
    end;

  end loop;

  error := format('rescue_table %s into %s: %s of %s pages are bad, %s bad tuples, %s tuples copied',
    relname, savename, bad_pages, pages, bad_tuples, good_tuples);
  raise log '%', error;
  return error;
end;
$$ language plpgsql;

The SQL scripts are also available in the pg_dirtyread git repository.

Support

Should you require assistance with data recovery or the general use of PostgreSQL®, our PostgreSQL® Competence Center is at your disposal – 24 hours a day, 365 days a year, if desired.

We look forward to hearing from you.

 

This article was originally written by Christoph Berg.

Patroni is a clustering solution for PostgreSQL® that is getting more and more popular in the cloud and Kubernetes sector due to its operator pattern and integration with Etcd or Consul. Some time ago we wrote a blog post about the integration of Patroni into Debian. Recently, the vip-manager project which is closely related to Patroni has been uploaded to Debian by us. We will present vip-manager and how we integrated it into Debian in the following.

To recap, Patroni uses a distributed consensus store (DCS) for leader-election and failover. The current cluster leader periodically updates its leader-key in the DCS. As soon the key cannot be updated by Patroni for whatever reason it becomes stale. A new leader election is then initiated among the remaining cluster nodes.

PostgreSQL Client-Solutions for High-Availability

From the user’s point of view it needs to be ensured that the application is always connected to the leader, as no write transactions are possible on the read-only standbys. Conventional high-availability solutions like Pacemaker utilize virtual IPs (VIPs) that are moved to the primary node in the case of a failover.

For Patroni, such a mechanism did not exist so far. Usually, HAProxy (or a similar solution) is used which does periodic health-checks on each node’s Patroni REST-API and routes the client requests to the current leader.

An alternative is client-based failover (which is available since PostgreSQL 10), where all cluster members are configured in the client connection string. After a connection failure the client tries each remaining cluster member in turn until it reaches a new primary.

vip-manager

A new and comfortable approach to client failover is vip-manager. It is a service written in Go that gets started on all cluster nodes and connects to the DCS. If the local node owns the leader-key, vip-manager starts the configured VIP. In case of a failover, vip-manager removes the VIP on the old leader and the corresponding service on the new leader starts it there. The clients are configured for the VIP and will always connect to the cluster leader.

Debian-Integration of vip-manager

For Debian, the pg_createconfig_patroni program from the Patroni package has been adapted so that it can now create a vip-manager configuration:

pg_createconfig_patroni 11 test --vip=10.0.3.2

Similar to Patroni, we start the service for each instance:

systemctl start vip-manager@11-test

The output of patronictl shows that pg1 is the leader:

+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

In journal of ‘pg1’ it can be seen that the VIP has been configured:

Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is false, desired true
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 Configuring address 10.0.3.2/24 on eth0
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is true, desired true

If LXC containers are used, one can also see the VIP in the output of lxc-ls -f:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.2, 10.0.3.247 -    false
pg2     RUNNING 0         -      10.0.3.94            -    false
pg3     RUNNING 0         -      10.0.3.214           -    false

The vip-manager packages are available for Debian testing (bullseye) and unstable, as well as for the upcoming 20.04 LTS Ubuntu release (focal) in the official repositories. For Debian stable (buster), as well as for Ubuntu 19.04 and 19.10, packages are available at apt.postgresql.org maintained by credativ, along with the updated Patroni packages with vip-manager integration.

Switchover Behaviour

In case of a planned switchover, e.g. pg2 becomes the new leader:

# patronictl -c /etc/patroni/11-test.yml switchover --master pg1 --candidate pg2 --force
Current cluster topology
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+
2020-01-19 15:35:32.52642 Successfully switched over to "pg2"
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 |        | stopped |    |   unknown |
| 11-test |  pg2   | 10.0.3.94  | Leader | running |  1 |           |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

The VIP has now been moved to the new leader:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.247          -    false
pg2     RUNNING 0         -      10.0.3.2, 10.0.3.94 -    false
pg3     RUNNING 0         -      10.0.3.214          -    false

This can also be seen in the journals, both from the old leader:

Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,634 INFO: manual failover: demoting myself
Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,854 INFO: Leader key released
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is true, desired false
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 Removing address 10.0.3.2/24 on eth0
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is false, desired false

As well as from the new leader pg2:

Jan 19 15:35:31 pg2 patroni[9229]: 2020-01-19 15:35:31,881 INFO: promoted self to leader by acquiring session lock
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is false, desired true
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 Configuring address 10.0.3.2/24 on eth0
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is true, desired true
Jan 19 15:35:32 pg2 patroni[9229]: 2020-01-19 15:35:32,923 INFO: Lock owner: pg2; I am pg2

As one can see, the VIP is moved within one second.

Updated Ansible Playbook

Our Ansible-Playbook for the automated setup of a three-node cluster on Debian has also been updated and can now configure a VIP if so desired:

# ansible-playbook -i inventory -e vip=10.0.3.2 patroni.yml

Questions and Help

Do you have any questions or need help? Feel free to write to info@credativ.com.