PostgreSQL® Archiv - Page 3 of 4 - credativ®

On June 27, it was that time again: we were able to attend PGConf.DE 2023. This year, the event was held at the Haus der Technik in Essen, marking the conference’s return to the Ruhr region for the first time since 2013. In addition to the usual excellent talks, catering, and atmosphere, a new visitor record was set with approximately 250 participants, making this year’s edition the largest German PostgreSQL event to date. Instaclustr was also represented for the first time this year as a gold sponsor with a booth. This allowed interested parties to learn about Instaclustr Managed Services and other offerings. Admission was at 8 a.m., and after setting up the booth, some initial small talk, and a welcome speech, the first talks started right on time at 9:10 a.m. The conference offered three parallel talks, allowing attendees to choose speakers and topics of interest in each slot. One of the speakers was our senior consultant, Michael Banck. He gave a presentation on secure PostgreSQL operation in accordance with BSI basic protection.

Michael Banck before his presentation: Secure PostgreSQL operation in accordance with BSI basic protection

Topics such as performance optimization, high availability (with Patroni, for example), and technical operations were among those on offer. There were also some very entertaining reports from the world of PostgreSQL consultants. My personal highlight was Laurenz Albe’s talk on data corruption. He presented the topic in a very practical and clear way and also showed very clear rules and ways to best deal with such a situation. PGConf.DE 2023 is a great event for training and networking. We are already looking forward to next year!

Moodle is a popular Open Source online learning platform. Especially since the beginning of the COVID-19 pandemic the importance of Moodle for schools and universities has further increased. In some states in Germany all schools had to switch to Moodle and other platforms like BigBlueButton in the course of a few days. This leads to scalability problems if suddenly several tens of thousands of pupils need to access Moodle.
Besides scaling the Moodle application itself, the database needs to be considered as well. One of the database options for Moodle is PostgreSQL. In this blog post, we present load-balancing options for Moodle using PostgreSQL.

High-Availability via Patroni

An online learning platform can be considered critical infrastructure from the point of view of the educational system and should be made highly available, in particular the database. A good solution for PostgreSQL is Patroni, we reported on its Debian-integration in the past.

In short, Patroni uses a distributed consensus store (DCS) to elect a leader from a typically 3-node cluster or initiate a failover and elect a new leader in the case of a leader failure, without entering a split-brain scenario. In addition, Patroni provides a REST API used for communication among nodes and from the patronictl program, e.g. to change the Postgres configuration online on all nodes or to initiate a switchover.

Client-solutions for high availability

From Moodle’s perspective, however, it must additionally be ensured that it is connected to the leader, otherwise no write transactions are possible. Traditional high-availability solutions such as Pacemaker use virtual IPs (VIPs) here, which are pivoted to the new primary node in the event of a failover. For Patroni there is the vip-manager project instead, which monitors the leader key in the DCS and sets or removes cluster VIP locally. This is also integrated into Debian as well.

An alternative is to use client-side failover based on PostgreSQL’s libpq library. For this, all cluster members are listed in the connection string and the connection option target_session_attrs=read-write is added. Configured this way, if a connection is broken, the client will try to reach the other nodes until a new primary is found.

Another option is HAProxy, a highly scalable TCP/HTTP load balancer. By performing periodic health checks on Patroni‘s REST API of each node, it can determine the current leader and forward client queries to it.

Moodle database configuration

Moodle’s connection to a PostgreSQL database is configured in config.php, e.g. for a simple stand-alone database:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '192.168.1.1';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbport' => '',
  'dbsocket' => ''
);

The default port 5432 is used here.

If streaming replication is used, the standbys can additionally be defined as readonly and assigned to an own database user (which only needs read permissions):

$CFG->dboptions = array (
[...]
  'readonly' => [
    'instance' => [
      [
      'dbhost' => '192.168.1.2',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.3',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ]
    ]
  ]
);

Failover/load balancing with libpq

If a highly available Postgres cluster is used with Patroni, the primary, as described above, can be switched to prevent loss of data or shutdown of the system, in case of a failover or switchover incident. Moodle does not provide a way to set generic database options here and thus setting target_session_attrs=read-write directly is not possible. Therefore we developed a patch for this and implemented it in the Moodle tracker. This allows the additional option 'dbfailover' => 1, in the $CFG->dboptions array, which adds the necessary connection option target_session_attrs=read-write. A customized config.php would look like this:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '192.168.1.1,192.168.1.2,192.168.1.3';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbfailover' => 1,
  'dbport' => '',
  'dbsocket' => '',
  'readonly' => [
    'instance' => [
      [
      'dbhost' => '192.168.1.1',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.2',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ],
      [
      'dbhost' => '192.168.1.3',
      'dbport' =>  '',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
      ]
    ]
  ]
);

Failover/load balancing with HAProxy

If HAProxy is to be used instead, then $CFG->dbhost must be set to the HAProxy host e.g. 127.0.0.1 in case HAProxy is running locally on the Moodle server(s). Moreover a second port (e.g. 65432) can be defined for read queries, which is configured as readonly in $CFG->dboptions, same as the streaming replication standby above. The config.php would then look like this:

$CFG->dbtype    = 'pgsql';
$CFG->dblibrary = 'native';
$CFG->dbhost    = '127.0.0.1';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'moodle';
$CFG->dbpass    = 'moodle';
$CFG->prefix    = 'mdl_';
$CFG->dboptions = array (
  'dbport' => '',
  'dbsocket' => '',
  'readonly' => [
    'instance' => [
      'dbhost' => '127.0.0.1',
      'dbport' =>  '65432',
      'dbuser' => 'moodle_safereads',
      'dbpass' => 'moodle'
    ]
  ]
);

The HAProxy configuration file haproxy.cfg can look like the following example:

global
    maxconn 100

defaults
    log global
    mode tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen postgres_write
    bind *:5432
    mode tcp
    option httpchk
    http-check expect status 200
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 maxconn 100 check port 8008
    server pg2 192.168.1.2:5432 maxconn 100 check port 8008
    server pg3 192.168.1.3:5432 maxconn 100 check port 8008

HAProxy expects incoming write connections (postgres_write) on port 5432 and forwards them to port 5432 of the cluster members. The primary is determined by an HTTP check on port 8008 (the default Patroni REST API port); Patroni returns status 200 here for the primary and status 503 for standbys.

For read queries (postgres_read), it must be decided whether the primary should also serve read-only queries or not. If this is the case, a simple Postgres check (pgsql-check) can be used; however, this may lead to entries in the PostgreSQL log regarding incorrect or incomplete logins:

listen postgres_read
    bind *:65432
    mode tcp
    balance leastconn
    option pgsql-check user haproxy
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 check
    server pg2 192.168.1.2:5432 check
    server pg3 192.168.1.3:5432 check

If you don’t want the primary to participate in the read scaling you can simply use the same HTTP check as in the postgres_write section, this time expecting HTTP status 503:

listen postgres_read
    bind *:65432
    mode tcp
    balance leastconn
    option httpchk
    http-check expect status 503
    default-server inter 3s fall 3 rise 3 on-marked-down shutdown-sessions
    server pg1 192.168.1.1:5432 check port 8008
    server pg2 192.168.1.2:5432 check port 8008
    server pg3 192.168.1.3:5432 check port 8008

Revised Ansible playbook

HAProxy support has also been implemented in version 0.3 of our Ansible playbooks for automated setup of a three-node PostgreSQL Patroni cluster on Debian. The new variable haproxy_primary_read_scale can be used to decide whether HAProxy should also issue requests on the read-only port to the primary node or only to the followers.

We are happy to help!

Whether it’s PostgreSQL, Patroni, HAProxy, Moodle, or any other open source software; with over 22+ years of development and service experience in the open source space, credativ GmbH can assist you with unparalleled and individually customizable support. We are there to help and assist you in all your open source infrastructure needs – if desired 24 hours a day, 365 days a year!

We look forward to hearing from you.

SQLreduce: Reduce verbose SQL queries to minimal examples

Developers often face very large SQL queries that raise some errors. SQLreduce is a tool to reduce that complexity to a minimal query.

SQLsmith generates random SQL queries

SQLsmith is a tool that generates random SQL queries and runs them against a PostgreSQL server (and other DBMS types). The idea is that by fuzz-testing the query parser and executor, corner-case bugs can be found that would otherwise go unnoticed in manual testing or with the fixed set of test cases in PostgreSQL’s regression test suite. It has proven to be an effective tool with over 100 bugs found in different areas in the PostgreSQL server and other products since 2015, including security bugs, ranging from executor bugs to segfaults in type and index method implementations. For example, in 2018, SQLsmith found that the following query triggered a segfault in PostgreSQL:

select
  case when pg_catalog.lastval() < pg_catalog.pg_stat_get_bgwriter_maxwritten_clean() then case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       else case when pg_catalog.circle_sub_pt(
          cast(cast(null as circle) as circle),
          cast((select location from public.emp limit 1 offset 13)
             as point)) ~ cast(nullif(case when cast(null as box) &> (select boxcol from public.brintest limit 1 offset 2)
                 then (select f1 from public.circle_tbl limit 1 offset 4)
               else (select f1 from public.circle_tbl limit 1 offset 4)
               end,
          case when (select pg_catalog.max(class) from public.f_star)
                 ~~ ref_0.c then cast(null as circle) else cast(null as circle) end
            ) as circle) then ref_0.a else ref_0.a end
       end as c0,
  case when (select intervalcol from public.brintest limit 1 offset 1)
         >= cast(null as "interval") then case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       else case when ((select pg_catalog.max(roomno) from public.room)
             !~~ ref_0.c)
        and (cast(null as xid) <> 100) then ref_0.b else ref_0.b end
       end as c1,
  ref_0.a as c2,
  (select a from public.idxpart1 limit 1 offset 5) as c3,
  ref_0.b as c4,
    pg_catalog.stddev(
      cast((select pg_catalog.sum(float4col) from public.brintest)
         as float4)) over (partition by ref_0.a,ref_0.b,ref_0.c order by ref_0.b) as c5,
  cast(nullif(ref_0.b, ref_0.a) as int4) as c6, ref_0.b as c7, ref_0.c as c8
from
  public.mlparted3 as ref_0
where true;

However, just like in this 40-line, 2.2kB example, the random queries generated by SQLsmith that trigger some error are most often very large and contain a lot of noise that does not contribute to the error. So far, manual inspection of the query and tedious editing was required to reduce the example to a minimal reproducer that developers can use to fix the problem.

Reduce complexity with SQLreduce

This issue is solved by SQLreduce. SQLreduce takes as input an arbitrary SQL query which is then run against a PostgreSQL server. Various simplification steps are applied, checking after each step that the simplified query still triggers the same error from PostgreSQL. The end result is a SQL query with minimal complexity.

SQLreduce is effective at reducing the queries from original error reports from SQLsmith to queries that match manually-reduced queries. For example, SQLreduce can effectively reduce the above monster query to just this:

SELECT pg_catalog.stddev(NULL) OVER () AS c5 FROM public.mlparted3 AS ref_0

Note that SQLreduce does not try to derive a query that is semantically identical to the original, or produces the same query result – the input is assumed to be faulty, and we are looking for the minimal query that produces the same error message from PostgreSQL when run against a database. If the input query happens to produce no error, the minimal query output by SQLreduce will just be SELECT.

How it works

We’ll use a simpler query to demonstrate how SQLreduce works and which steps are taken to remove noise from the input. The query is bogus and contains a bit of clutter that we want to remove:

$ psql -c 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'
ERROR:  column pg_database.reltuples does not exist

Let’s pass the query to SQLreduce:

$ sqlreduce 'select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10'

SQLreduce starts by parsing the input using pglast and libpg_query which expose the original PostgreSQL parser as a library with Python bindings. The result is a parse tree that is the basis for the next steps. The parse tree looks like this:

selectStmt
├── targetList
│   └── /
│       ├── pg_database.reltuples
│       └── 1000
├── fromClause
│   ├── pg_database
│   └── pg_class
├── whereClause
│   └── <
│       ├── 0
│       └── /
│           ├── pg_database.reltuples
│           └── 1000
├── orderClause
│   └── 1
└── limitCount
    └── 10

Pglast also contains a query renderer that can render back the parse tree as SQL, shown as the regenerated query below. The input query is run against PostgreSQL to determine the result, in this case ERROR: column pg_database.reltuples does not exist.

Input query: select pg_database.reltuples / 1000 from pg_database, pg_class where 0 < pg_database.reltuples / 1000 order by 1 desc limit 10
Regenerated: SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC LIMIT 10
Query returns: ✔ ERROR:  column pg_database.reltuples does not exist

SQLreduce works by deriving new parse trees that are structurally simpler, generating SQL from that, and run these queries against the database. The first simplification steps work on the top level node, where SQLreduce tries to remove whole subtrees to quickly find a result. The first reduction tried is to remove LIMIT 10:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ORDER BY 1 DESC ✔

The query result is still ERROR: column pg_database.reltuples does not exist, indicated by a ✔ check mark. Next, ORDER BY 1 is removed, again successfully:

SELECT pg_database.reltuples / 1000 FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Now the entire target list is removed:

SELECT FROM pg_database, pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✔

This shorter query is still equivalent to the original regarding the error message returned when it is run against the database. Now the first unsuccessful reduction step is tried, removing the entire FROM clause:

SELECT WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"

That query is also faulty, but triggers a different error message, so the previous parse tree is kept for the next steps. Again a whole subtree is removed, now the WHERE clause:

SELECT FROM pg_database, pg_class ✘ no error

We have now reduced the input query so much that it doesn’t error out any more. The previous parse tree is still kept which now looks like this:

selectStmt
├── fromClause
│   ├── pg_database
│   └── pg_class
└── whereClause
    └── <
        ├── 0
        └── /
            ├── pg_database.reltuples
            └── 1000

Now SQLreduce starts digging into the tree. There are several entries in the FROM clause, so it tries to shorten the list. First, pg_database is removed, but that doesn’t work, so pg_class is removed:

SELECT FROM pg_class WHERE 0 < ((pg_database.reltuples / 1000)) ✘ ERROR:  missing FROM-clause entry for table "pg_database"
SELECT FROM pg_database WHERE 0 < ((pg_database.reltuples / 1000)) ✔

Since we have found a new minimal query, recursion restarts at top-level with another try to remove the WHERE clause. Since that doesn’t work, it tries to replace the expression with NULL, but that doesn’t work either.

SELECT FROM pg_database ✘ no error
SELECT FROM pg_database WHERE NULL ✘ no error

Now a new kind of step is tried: expression pull-up. We descend into WHERE clause, where we replace A < B first by A and then by B.

SELECT FROM pg_database WHERE 0 ✘ ERROR:  argument of WHERE must be type boolean, not type integer
SELECT FROM pg_database WHERE pg_database.reltuples / 1000 ✔
SELECT WHERE pg_database.reltuples / 1000 ✘ ERROR:  missing FROM-clause entry for table "pg_database"

The first try did not work, but the second one did. Since we simplified the query, we restart at top-level to check if the FROM clause can be removed, but it is still required.

From A / B, we can again pull up A:

SELECT FROM pg_database WHERE pg_database.reltuples ✔
SELECT WHERE pg_database.reltuples ✘ ERROR:  missing FROM-clause entry for table "pg_database"

SQLreduce has found the minimal query that still raises ERROR: column pg_database.reltuples does not exist with this parse tree:

selectStmt
├── fromClause
│   └── pg_database
└── whereClause
    └── pg_database.reltuples

At the end of the run, the query is printed along with some statistics:

Minimal query yielding the same error:
SELECT FROM pg_database WHERE pg_database.reltuples

Pretty-printed minimal query:
SELECT
FROM pg_database
WHERE pg_database.reltuples

Seen: 15 items, 915 Bytes
Iterations: 19
Runtime: 0.107 s, 139.7 q/s

This minimal query can now be inspected to fix the bug in PostgreSQL or in the application.

About credativ

The credativ GmbH is a manufacturer-independent consulting and service company located in Moenchengladbach, Germany. With over 22+ years of development and service experience in the open source space, credativ GmbH can assist you with unparalleled and individually customizable support. We are here to help and assist you in all your open source infrastructure needs.

This article was initially written by Christoph Berg.

pg_dirtyread

I recently updated pg_dirtyread to work with PostgreSQL 14. pg_dirtyread is a PostgreSQL extension that allows reading “dead” rows from tables, i.e., rows that have already been deleted or updated. Of course, this only works if the table has not yet been cleaned by a VACUUM command or autovacuum, PostgreSQL’s garbage collection.

Here is an example of pg_dirtyread in action:

# create table foo (id int, t text);
CREATE TABLE
# insert into foo values (1, 'Doc1');
INSERT 0 1
# insert into foo values (2, 'Doc2');
INSERT 0 1
# insert into foo values (3, 'Doc3');
INSERT 0 1

# select * from foo;
 id │ t
────┼──────
 1 │ Doc1
 2 │ Doc2
 3 │ Doc3
(3 rows)

# delete from foo where id < 3;
DELETE 2

# select * from foo;
 id │ t
────┼──────
  3 │ Doc3
(1 row)

Oops! The first two documents are gone.

Let’s now use pg_dirtyread to view the table:

# create extension pg_dirtyread;
CREATE EXTENSION

# select * from pg_dirtyread('foo') t(id int, t text);
  id │ t
────┼──────
  1 │ Doc1
  2 │ Doc2
  3 │ Doc3

All three documents are still present, but only one of them is visible.

pg_dirtyread can also display the system columns of PostgreSQL with information about the position and visibility of the rows. For the first two documents, xmax is set, which means that the row has been deleted:

# select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);

 ctid │ xmin │ xmax │ id │ t
───────┼──────┼──────┼────┼──────
 (0,1) │ 1577 │ 1580 │ 1 │ Doc1
 (0,2) │ 1578 │ 1580 │ 2 │ Doc2
 (0,3) │ 1579 │ 0 │ 3 │ Doc3
(3 rows)

Undelete

Caveat: I cannot promise that any of the ideas listed below will work in practice. There are some uncertainties and a good deal of complicated knowledge about PostgreSQL internals may be required to succeed. I would ask you to consider consulting your preferred PostgreSQL service provider for advice before attempting to restore data on a production system. Do not try this at work!

I always had plans to extend pg_dirtyread with an “Undelete” command to make deleted rows reappear, but unfortunately, I never got around to trying it out in the past. However, rows can already be restored by using the output of pg_dirtyread itself:

# insert into foo select * from pg_dirtyread('foo') t(id int, t text) where id = 1;

However, this is not a real “Undelete” or “undo” – it only inserts new rows from the data read from the table.

pg_surgery

Let’s welcome pg_surgery, a new PostgreSQL extension that comes with PostgreSQL 14. It contains two functions for “surgery on a broken relationship”. As a side effect, they can also make deleted tuples reappear.

As I have now discovered, one of the functions, heap_force_freeze(), works well with pg_dirtyread. It takes a list of ctids (row positions) that it marks as “frozen” but also as “not deleted”.

Let’s apply it to our test table using the ctids that pg_dirtyread can read:

# create extension pg_surgery;
CREATE EXTENSION

# select heap_force_freeze('foo', array_agg(ctid))
    from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text) where id = 1;
 heap_force_freeze
───────────────────

(1 row)

And voilà, our deleted document is back:

# select * from foo;
  id │ t
────┼──────
  1 │ Doc1
  3 │ Doc3
(2 rows)

#select * from pg_dirtyread('foo') t(ctid tid, xmin xid, xmax xid, id int, t text);
  ctid │ xmin │ xmax │ id │ t
───────┼──────┼──────┼────┼──────
 (0,1) │ 2 │ 0 │ 1 │ Doc1
  (0,2) │ 1578 │ 1580 │ 2 │ Doc2
  (0,3) │ 1579 │ 0 │ 3 │ Doc3
(3 rows)

Disclaimer

Most importantly, none of the above methods will work if the data you just deleted has already been cleaned by VACUUM or Autovacuum. This actively deletes the recovered storage space. Restore your data from the backup.

Since both pg_dirtyread and pg_surgery operate outside of the normal PostgreSQL MVCC machinery, it is easy to create corrupt data with them. This includes duplicate rows, duplicate primary key values, indexes that are not synchronized with the tables, broken foreign key constraints, and others. You have been warned.

Furthermore, pg_dirtyread does not (yet) work if the deleted rows contain any toasted values. Possible other approaches are the use of pageinspect and pg_filedump to get the ctids of the deleted rows. Please make sure you have working backups and do not need any of the above methods.

We are Happy to Support You

The necessity of the above operations can be completely prevented in most cases by professional administration or setup of the PostgreSQL infrastructure by a competent service provider. If you are currently facing one of these problems and are looking for ways out of a predicament, please contact us by email at info@credativ.de or via the contact form.

With over 22+ years of development and service experience in the PostgreSQL and Open Source area, credativ GmbH can professionally accompany you with an unprecedented and individually configurable support and fully support you in all questions regarding your Open Source infrastructure.

 

This article was originally written by Christoph Berg

Also Worth Reading

Congratulations to the Debian Community

The Debian Project just released version 11 (aka “bullseye”) of their free operating system. In total, over 6,208 contributors worked on this release and were indispensable in making this launch happen. We would like to thank everyone involved for their combined efforts, hard work, and many hours pent in recent years building this new release that will benefit the entire open source community.

We would also like to acknowledge our in-house Debian developers who contributed to this effort. We really appreciate the work you do on behalf of the community and stand firmly behind your contributions.

What’s New in Debian 11 Bullseye

Debian 11 comes with a number of meaningful changes and enhancements. The new release includes over 13,370 new software packages, for a total of over 57,703 packages on release. Out of these, 35,532 packages have been updated to newer versions, including an update in the kernel from 4.19 in “buster” to 5.10 in bullseye.

Bullseye expands on the capabilities of driverless printing with Common Unix Printing System (CUPS) and driverless scanning with Scanner Access Now Easy (SANE). While it was possible to use CUPS for driverless printing with buster, bullseye comes with the package ipp-usb, which allows a USB device to be treated as a network device and thus extend driverless printing capabilities. SANE connects to this when set up correctly and connected to a USB port.

As in previous releases, Debian 11 comes with a Debian Edu / Skolelinux version. Debian Edu has been a complete solution for schools for many years. Debian Edu can provide the entire network for a school and then only users and machines need to be added after installation. This can also be easily managed via the web interface GOsa².

Debian 11 bullseye can be downloaded here.
https://www.debian.org/devel/debian-installer/index.en.html

For more information and greater technical detail on the new Debian 11 release, please refer to the official release notes on Debian.org

https://www.debian.org/releases/bullseye/amd64/release-notes/.

Contributions by Instaclustr Employees

Our Debian roots run deep here. credativ, which was acquired by Instaclustr in March 2021, has always been an active part of the Debian community and visited every DebConf since 2004. Debian also serves as the operating system at the heart of the Instaclustr Managed Platform.

For the release of Debian 11, our team has taken over various responsibilities in the community. Our contributions include:

Many of our colleagues have made significant contributions to the current release, including:

How to Upgrade

Given that Debian 11 bullseye is a major release, we suggest that everyone running on Debian 10 buster upgrade. The main steps for an upgrade include:

  1. Make sure to backup any data that should not get lost and prepare for recovery
  2. Remove non-Debian packages and clean up leftover files and old versions
  3. Upgrade to latest point release
  4. Check and prepare your APT source-list files by adding the relevant Internet sources or local mirrors
  5. Upgrade your packages and then upgrade your system

You can find a more detailed walkthrough of the upgrade process in the Debian documentation.

All existing credativ customers who are running a Debian-based installation are naturally covered by our service and support and are encouraged to reach out.

If you are interested in upgrading from your old Debian version, or if you have questions with regards to your Debian infrastructure, do not hesitate to drop us an email or contact us at info@credativ.de.

Or, you can get started in minutes with any one of these open source technologies like Apache Cassandra, Apache Kafka, Redis, and OpenSearch on the Instaclustr Managed Platform. Sign up for a free trial today.

The apt.postgresql.org repository originally started with the two architectures amd64 and i386 (64- and 32-bit x64). In September 2016, ppc64el (POWER) was added. Over time, there have been repeated requests as to whether we might also support “arm”, which usually meant Raspberry Pi. However, these are mostly only 32-bit, and the widespread “armhf” Raspbian port is unfortunately only ARM6, an older hardware version.

Through HUAWEI Cloud Services, an “arm64” build machine has now been made available to the PostgreSQL® community, which is a modern processor architecture that is also suitable for PostgreSQL® servers. We then set up the machine and expanded the apt.postgresql.org repository to include this architecture. For the supported distributions, we chose Debian buster (stable), bullseye (testing) and sid (unstable) as well as Ubuntu bionic (18.04) and focal (20.04).

The build machine is very powerful. We built all the packages for the new architecture in just a few days. Very few special arm-specific problems occurred, which speaks for the stability of the Linux port on this architecture.

There is nothing standing in the way of using PostgreSQL® on arm64, on Debian or Ubuntu.

In parallel, we have expanded the repository to include support for the new Ubuntu LTS version: focal (20.04).
This distribution can now be used with support for five years until April 2025.

The credativ PostgreSQL® Competence Center is of course always happy to answer questions about the use of PostgreSQL® on arm and other architectures on Debian, Ubuntu and other operating systems. Please contact us!

This article was originally written by Christoph Berg.

PostgreSQL® is an extremely robust database, to which most of our clients also entrust their data. However, if errors do occur, they are usually due to the storage system, where individual bits or bytes flip, or entire blocks become corrupted. We demonstrate how to recover data from corrupt tables.

In case of an error, the user is confronted with messages that originate from the storage layer or other PostgreSQL® subsystems:

postgres=# select * from t;
ERROR:  missing chunk number 0 for toast value 192436 in pg_toast_192430

postgres=# select * from a;
ERROR:  invalid memory alloc request size 18446744073709551613

If only individual tuples are corrupt, one can partially help by reading them out individually, e.g., according to id, which often does not help further:

select * from t where id = 1;

It is more promising to address the tuples directly by their internal tuple ID, called in PostgreSQL® ctid:

select * from t where ctid = '(0,1)';

To read out all tuples, we use a loop in plpgsql:

for page in 0 .. pages-1 loop
  for item in 1 .. ??? loop
     select * from t where ctid = '('||page||','||item||')' into r;
     return next r;
  end loop;
end loop;

We still need the number of pages in the table, which we get from pg_relation_size(), and the number of tuples on the page, for which we utilize the pageinspect extension.

select pg_relation_size(relname) / current_setting('block_size')::int into pages;

for page in 0 .. pages-1 loop
  for item in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
    SELECT * FROM t WHERE ctid=item into r;
    if r is not null then
      return next r;
    end if;
  end loop;
end loop;

Now comes the most important part: Accessing the damaged tuples or pages causes errors that we must catch with a begin..exception..end block. We pass the error messages to the user as . Furthermore, the function should not only work for one table but also receive a parameter . The entire plpgsql function then looks like this:

create extension pageinspect;

create or replace function read_table(relname regclass)
returns setof record
as $$
declare
  pages int;
  page int;
  ctid tid;
  r record;
  sql_state text;
  error text;
begin
  select pg_relation_size(relname) / current_setting('block_size')::int into pages;

  for page in 0 .. pages-1 loop

    begin

      for ctid in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
        begin
          execute format('SELECT * FROM %s WHERE ctid=%L', relname, ctid) into r;
          if r is not null then
            return next r;
          end if;
        exception -- bad tuple
          when others then
            get stacked diagnostics sql_state := RETURNED_SQLSTATE;
            get stacked diagnostics error := MESSAGE_TEXT;
            raise notice 'Skipping ctid %: %: %', ctid, sql_state, error;
        end;
      end loop;

    exception -- bad page
      when others then
        get stacked diagnostics sql_state := RETURNED_SQLSTATE;
        get stacked diagnostics error := MESSAGE_TEXT;
        raise notice 'Skipping page %: %: %', page, sql_state, error;
    end;

  end loop;
end;
$$ language plpgsql;

Since the function returns “record”, the table signature must be provided during the call:

postgres =# select * from read_table('t') as t(t text);
NOTICE:  Skipping ctid (0,1): XX000: missing chunk number 0 for toast value 192436 in pg_toast_192430
       t
───────────────
 one
 two
 three
...

An alternative variant writes the read data directly into a new table:

postgres =# select rescue_table('t');
NOTICE:  t: page 0 of 1
NOTICE:  Skipping ctid (0,1): XX000: missing chunk number 0 for toast value 192436 in pg_toast_192430
                                    rescue_table
─────────────────────────────────────────────────────────────────────────────────────
 rescue_table t into t_rescue: 0 of 1 pages are bad, 1 bad tuples, 100 tuples copied
(1 row)

The table t_rescue was created automatically.

create extension pageinspect;

create or replace function rescue_table(relname regclass, savename name default null, "create" boolean default true)
returns text
as $$
declare
  pages int;
  page int;
  ctid tid;
  row_count bigint;
  good_tuples bigint := 0;
  bad_pages bigint := 0;
  bad_tuples bigint := 0;
  sql_state text;
  error text;
begin
  if savename is null then
    savename := relname || '_rescue';
  end if;
  if rescue_table.create then
    execute format('CREATE TABLE %s (LIKE %s)', savename, relname);
  end if;

  select pg_relation_size(relname) / current_setting('block_size')::int into pages;

  for page in 0 .. pages-1 loop
    if page % 10000 = 0 then
      raise notice '%: page % of %', relname, page, pages;
    end if;

    begin

      for ctid in select t_ctid from heap_page_items(get_raw_page(relname::text, page)) loop
        begin
          execute format('INSERT INTO %s SELECT * FROM %s WHERE ctid=%L', savename, relname, ctid);
          get diagnostics row_count = ROW_COUNT;
          good_tuples := good_tuples + row_count;
        exception -- bad tuple
          when others then
            get stacked diagnostics sql_state := RETURNED_SQLSTATE;
            get stacked diagnostics error := MESSAGE_TEXT;
            raise notice 'Skipping ctid %: %: %', ctid, sql_state, error;
            bad_tuples := bad_tuples + 1;
        end;
      end loop;

    exception -- bad page
      when others then
        get stacked diagnostics sql_state := RETURNED_SQLSTATE;
        get stacked diagnostics error := MESSAGE_TEXT;
        raise notice 'Skipping page %: %: %', page, sql_state, error;
        bad_pages := bad_pages + 1;
    end;

  end loop;

  error := format('rescue_table %s into %s: %s of %s pages are bad, %s bad tuples, %s tuples copied',
    relname, savename, bad_pages, pages, bad_tuples, good_tuples);
  raise log '%', error;
  return error;
end;
$$ language plpgsql;

The SQL scripts are also available in the pg_dirtyread git repository.

Support

Should you require assistance with data recovery or the general use of PostgreSQL®, our PostgreSQL® Competence Center is at your disposal – 24 hours a day, 365 days a year, if desired.

We look forward to hearing from you.

 

This article was originally written by Christoph Berg.

Patroni is a clustering solution for PostgreSQL® that is getting more and more popular in the cloud and Kubernetes sector due to its operator pattern and integration with Etcd or Consul. Some time ago we wrote a blog post about the integration of Patroni into Debian. Recently, the vip-manager project which is closely related to Patroni has been uploaded to Debian by us. We will present vip-manager and how we integrated it into Debian in the following.

To recap, Patroni uses a distributed consensus store (DCS) for leader-election and failover. The current cluster leader periodically updates its leader-key in the DCS. As soon the key cannot be updated by Patroni for whatever reason it becomes stale. A new leader election is then initiated among the remaining cluster nodes.

PostgreSQL Client-Solutions for High-Availability

From the user’s point of view it needs to be ensured that the application is always connected to the leader, as no write transactions are possible on the read-only standbys. Conventional high-availability solutions like Pacemaker utilize virtual IPs (VIPs) that are moved to the primary node in the case of a failover.

For Patroni, such a mechanism did not exist so far. Usually, HAProxy (or a similar solution) is used which does periodic health-checks on each node’s Patroni REST-API and routes the client requests to the current leader.

An alternative is client-based failover (which is available since PostgreSQL 10), where all cluster members are configured in the client connection string. After a connection failure the client tries each remaining cluster member in turn until it reaches a new primary.

vip-manager

A new and comfortable approach to client failover is vip-manager. It is a service written in Go that gets started on all cluster nodes and connects to the DCS. If the local node owns the leader-key, vip-manager starts the configured VIP. In case of a failover, vip-manager removes the VIP on the old leader and the corresponding service on the new leader starts it there. The clients are configured for the VIP and will always connect to the cluster leader.

Debian-Integration of vip-manager

For Debian, the pg_createconfig_patroni program from the Patroni package has been adapted so that it can now create a vip-manager configuration:

pg_createconfig_patroni 11 test --vip=10.0.3.2

Similar to Patroni, we start the service for each instance:

systemctl start vip-manager@11-test

The output of patronictl shows that pg1 is the leader:

+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

In journal of ‘pg1’ it can be seen that the VIP has been configured:

Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is false, desired true
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 Configuring address 10.0.3.2/24 on eth0
Jan 19 14:53:38 pg1 vip-manager[9314]: 2020/01/19 14:53:38 IP address 10.0.3.2/24 state is true, desired true

If LXC containers are used, one can also see the VIP in the output of lxc-ls -f:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.2, 10.0.3.247 -    false
pg2     RUNNING 0         -      10.0.3.94            -    false
pg3     RUNNING 0         -      10.0.3.214           -    false

The vip-manager packages are available for Debian testing (bullseye) and unstable, as well as for the upcoming 20.04 LTS Ubuntu release (focal) in the official repositories. For Debian stable (buster), as well as for Ubuntu 19.04 and 19.10, packages are available at apt.postgresql.org maintained by credativ, along with the updated Patroni packages with vip-manager integration.

Switchover Behaviour

In case of a planned switchover, e.g. pg2 becomes the new leader:

# patronictl -c /etc/patroni/11-test.yml switchover --master pg1 --candidate pg2 --force
Current cluster topology
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 | Leader | running |  1 |           |
| 11-test |  pg2   | 10.0.3.94  |        | running |  1 |         0 |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+
2020-01-19 15:35:32.52642 Successfully switched over to "pg2"
+---------+--------+------------+--------+---------+----+-----------+
| Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
+---------+--------+------------+--------+---------+----+-----------+
| 11-test |  pg1   | 10.0.3.247 |        | stopped |    |   unknown |
| 11-test |  pg2   | 10.0.3.94  | Leader | running |  1 |           |
| 11-test |  pg3   | 10.0.3.214 |        | running |  1 |         0 |
+---------+--------+------------+--------+---------+----+-----------+

The VIP has now been moved to the new leader:

NAME    STATE   AUTOSTART GROUPS IPV4                 IPV6 UNPRIVILEGED
pg1     RUNNING 0         -      10.0.3.247          -    false
pg2     RUNNING 0         -      10.0.3.2, 10.0.3.94 -    false
pg3     RUNNING 0         -      10.0.3.214          -    false

This can also be seen in the journals, both from the old leader:

Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,634 INFO: manual failover: demoting myself
Jan 19 15:35:31 pg1 patroni[9222]: 2020-01-19 15:35:31,854 INFO: Leader key released
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is true, desired false
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 Removing address 10.0.3.2/24 on eth0
Jan 19 15:35:32 pg1 vip-manager[9314]: 2020/01/19 15:35:32 IP address 10.0.3.2/24 state is false, desired false

As well as from the new leader pg2:

Jan 19 15:35:31 pg2 patroni[9229]: 2020-01-19 15:35:31,881 INFO: promoted self to leader by acquiring session lock
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is false, desired true
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 Configuring address 10.0.3.2/24 on eth0
Jan 19 15:35:31 pg2 vip-manager[9292]: 2020/01/19 15:35:31 IP address 10.0.3.2/24 state is true, desired true
Jan 19 15:35:32 pg2 patroni[9229]: 2020-01-19 15:35:32,923 INFO: Lock owner: pg2; I am pg2

As one can see, the VIP is moved within one second.

Updated Ansible Playbook

Our Ansible-Playbook for the automated setup of a three-node cluster on Debian has also been updated and can now configure a VIP if so desired:

# ansible-playbook -i inventory -e vip=10.0.3.2 patroni.yml

Questions and Help

Do you have any questions or need help? Feel free to write to info@credativ.com.

The PostgreSQL® Global Development Group (PGDG) has released version 12 of the popular free PostgreSQL® database. As our article for Beta 4 has already indicated, a number of new features, improvements and optimizations have been incorporated into the release. These include among others:

Optimized disk space utilization and speed for btree indexes

btree-Indexes, the default index type in PostgreSQL®, has experienced some optimizations in PostgreSQL® 12.

btree indexes used to store duplicates (multiple entries with the same key values) in an unsorted order. This has resulted in suboptimal use of physical representation in these indexes. An optimization now stores these multiple key values in the same order as they are physically stored in the table. This improves disk space utilization and the effort required to manage corresponding btree type indexes. In addition, indexes with multiple indexed columns use an improved physical representation so that their storage utilization is also improved. To take advantage of this in PostgreSQL® 12, however, if they were upgraded to the new version using pg_upgrade via a binary upgrade, these indexes must be recreated or re-indexed.

Insert operations in btree indexes are also accelerated by improved locking.

Improvements for pg_checksums

credativ has contributed an extension for pg_checksums that allows to enable or disable block checksums in stopped PostgreSQL® instances. Previously, this could only be done by recreating the physical data representation of the cluster using initdb.
pg_checksums now has the option to display a status history on the console with the --progress parameter. The corresponding code contributions come from the colleagues Michael Banck and Bernd Helmle.

Optimizer Inlining of Common Table Expressions

Up to and including PostgreSQL® 11, the PostgreSQL® Optimizer was unable to optimize common table expressions (also called CTE or WITH queries). If such an expression was used in a query, the CTE was always evaluated and materialized first before the rest of the query was processed. This resulted in expensive execution plans for more complex CTE expressions. The following generic example illustrates this. A join is given with a CTE expression that filters all even numbers from a numeric column:

WITH t_cte AS (SELECT id FROM foo WHERE id % 2 = 0) SELECT COUNT(*) FROM t_cte JOIN bar USING(id);

In PostgreSQL® 11, using a CTE always leads to a CTE scan that materializes the CTE expression first:

EXPLAIN (ANALYZE, BUFFERS) WITH t_cte AS (SELECT id FROM foo WHERE id % 2 = 0) SELECT COUNT(*) FROM t_cte JOIN bar USING(id) ;
                                                       QUERY PLAN                                                        
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Aggregate  (cost=2231.12..2231.14 rows=1 width=8) (actual time=48.684..48.684 rows=1 loops=1)
   Buffers: shared hit=488
   CTE t_cte
     ->  Seq Scan on foo  (cost=0.00..1943.00 rows=500 width=4) (actual time=0.055..17.146 rows=50000 loops=1)
           Filter: ((id % 2) = 0)
           Rows Removed by Filter: 50000
           Buffers: shared hit=443
   ->  Hash Join  (cost=270.00..286.88 rows=500 width=0) (actual time=7.297..47.966 rows=5000 loops=1)
         Hash Cond: (t_cte.id = bar.id)
         Buffers: shared hit=488
         ->  CTE Scan on t_cte  (cost=0.00..10.00 rows=500 width=4) (actual time=0.063..31.158 rows=50000 loops=1)
               Buffers: shared hit=443
         ->  Hash  (cost=145.00..145.00 rows=10000 width=4) (actual time=7.191..7.192 rows=10000 loops=1)
               Buckets: 16384  Batches: 1  Memory Usage: 480kB
               Buffers: shared hit=45
               ->  Seq Scan on bar  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.029..3.031 rows=10000 loops=1)
                     Buffers: shared hit=45
 Planning Time: 0.832 ms
 Execution Time: 50.562 ms
(19 rows)

This plan first materializes the CTE with a sequential scan with a corresponding filter (id % 2 = 0). Here no functional index is used, therefore this scan is correspondingly more expensive. Then the result of the CTE is linked to the table bar by Hash Join with the corresponding Join condition. With PostgreSQL® 12, the optimizer now has the ability to inline these CTE expressions without prior materialization. The underlying optimized plan in PostgreSQL® 12 will look like this:

EXPLAIN (ANALYZE, BUFFERS) WITH t_cte AS (SELECT id FROM foo WHERE id % 2 = 0) SELECT COUNT(*) FROM t_cte JOIN bar USING(id) ;
                                                                QUERY PLAN                                                                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Aggregate  (cost=706.43..706.44 rows=1 width=8) (actual time=9.203..9.203 rows=1 loops=1)
   Buffers: shared hit=148
   ->  Merge Join  (cost=0.71..706.30 rows=50 width=0) (actual time=0.099..8.771 rows=5000 loops=1)
         Merge Cond: (foo.id = bar.id)
         Buffers: shared hit=148
         ->  Index Only Scan using foo_id_idx on foo  (cost=0.29..3550.29 rows=500 width=4) (actual time=0.053..3.490 rows=5001 loops=1)
               Filter: ((id % 2) = 0)
               Rows Removed by Filter: 5001
               Heap Fetches: 10002
               Buffers: shared hit=74
         ->  Index Only Scan using bar_id_idx on bar  (cost=0.29..318.29 rows=10000 width=4) (actual time=0.038..3.186 rows=10000 loops=1)
               Heap Fetches: 10000
               Buffers: shared hit=74
 Planning Time: 0.646 ms
 Execution Time: 9.268 ms
(15 rows)

The advantage of this method is that there is no initial materialization of the CTE expression. Instead, the query is executed directly with a Join. This works for all non-recursive CTE expressions without side effects (for example, CTEs with write statements) and those that are referenced only once per query. The old behavior of the optimizer can be forced with the WITH ... AS MATERIALIZED ... directive.

Generated Columns

Generated Columns in PostgreSQL® 12 are materialized columns, which calculate a result based on expressions using existing column values. These are stored with the corresponding result values in the tuple. The advantage is that there is no need to create triggers for subsequent calculation of column values. The following simple example illustrates the new functionality using a price table with net and gross prices:

CREATE TABLE preise(netto numeric,
                    brutto numeric GENERATED ALWAYS AS (netto * 1.19) STORED);
 
INSERT INTO preise VALUES(17.30);
INSERT INTO preise VALUES(225);
INSERT INTO preise VALUES(247);
INSERT INTO preise VALUES(19.15);
 
SELECT * FROM preise;
 netto │ brutto
───────┼─────────
 17.30 │ 20.5870
   225 │  267.75
   247 │  293.93
 19.15 │ 22.7885
(4 rows)

The column brutto is calculated directly from the net price. The keyword STORED is mandatory. Of course, indexes can also be created on Generated Columns, but they cannot be part of a primary key. Furthermore, the SQL expression must be unique, i.e. it must return the same result even if the input quantity is the same. Columns declared as Generated Columns cannot be used explicitly in INSERT or UPDATE operations. If a column list is absolutely necessary, the corresponding value can be indirectly referenced with the keyword DEFAULT.

Omission of explicit OID columns

Explicit OID columns have historically been a way to create unique column values so that a table row can be uniquely identified database-wide. However, for a long time PostgreSQL® has only created these explicitly and considered their basic functionality obsolete. With PostgreSQL® the possibility to create such columns explicitly is now finally abolished. This means that it will no longer be possible to specify the WITH OIDS directive for tables. System tables that have always referenced OID objects uniquely will now return OID values without explicitly specifying OID columns in the result set. Especially older software, which handled catalog queries carelessly, could get problems with a double column output.

Moving recovery.conf to postgresql.conf

Up to and including PostgreSQL® 11, database recovery and streaming replication instances were configured via a separate configuration file recovery.conf.

With PostgreSQL® 12, all configuration work done there now migrates to postgresql.conf. The recovery.conf file is no longer required. PostgreSQL® 12 refuses to start as soon as this file exists. Whether recovery or streaming standby is desired is now decided either by a recovery.signal file (for recovery) or by a standby.signal file (for standby systems). The latter has priority if both files are present. The old parameter standby_mode, which controlled this behavior since then, has been removed.

For automatic deployments of high-availability systems, this means a major change. However, it is now also possible to perform corresponding configuration work almost completely using the ALTER SYSTEM command.

REINDEX CONCURRENTLY

With PostgreSQL® 12 there is now a way to re-create indexes with as few locks as possible. This greatly simplifies one of the most common maintenance tasks in very write-intensive databases. Previously, a combination of CREATE INDEX CONCURRENTLY and DROP INDEX CONCURRENTLY had to be used. In doing so, it was also necessary to ensure that index names were reassigned accordingly, if required.

The release notes give an even more detailed overview of all new features and above all incompatibilities with previous PostgreSQL® versions.

Yesterday, the fourth beta of the upcoming PostgreSQL®-major version 12 was released.

Compared to its predecessor PostgreSQL® 11, there are many new features:

Of course, PostgreSQL® 12 will be tested using sqlsmith, the SQL “fuzzer” from our colleague Andreas Seltenreich. Numerous bugs in different PostgreSQL® versions were found with sqlsmith by using randomly generated SQL queries.

Debian and Ubuntu packages for PostgreSQL® 12 are going to be published on apt.postgresql.org with credativ’s help. This work will be handled by our colleague Christoph Berg.

The release of PostgreSQL® 12 is expected in the next weeks.

In this article we will look at the highly available operation of PostgreSQL® in a Kubernetes environment. A topic that is certainly of particular interest to many of our PostgreSQL® users.

Together with our partner company MayaData, we will demonstrate below the application possibilities and advantages of the extremely powerful open source project – OpenEBS.

OpenEBS is a freely available storage management system, whose development is supported and backed by MayaData.

We would like to thank Murat-Karslioglu from MayaData and our colleague Adrian Vondendriesch for this interesting and helpful article. This article simultaneously also appeared on OpenEBS.io.

PostgreSQL® anywhere — via Kubernetes with some help from OpenEBS and credativ engineering

by Murat Karslioglu, OpenEBS and Adrian Vondendriesch, credativ

Introduction

If you are already running Kubernetes on some form of cloud whether on-premises or as a service, you understand the ease-of-use, scalability and monitoring benefits of Kubernetes — and you may well be looking at how to apply those benefits to the operation of your databases.

PostgreSQL® remains a preferred relational database, and although setting up a highly available Postgres cluster from scratch might be challenging at first, we are seeing patterns emerging that allow PostgreSQL® to run as a first class citizen within Kubernetes, improving availability, reducing management time and overhead, and limiting cloud or data center lock-in.

There are many ways to run high availability with PostgreSQL®; for a list, see the PostgreSQL® Documentation. Some common cloud-native Postgres cluster deployment projects include Crunchy Data’s, Sorint.lab’s Stolon and Zalando’s Patroni/Spilo. Thus far we are seeing Zalando’s operator as a preferred solution in part because it seems to be simpler to understand and we’ve seen it operate well.

Some quick background on your authors:

  • OpenEBS is a broadly deployed OpenSource storage and storage management project sponsored by MayaData.
  • credativ is a leading open source support and engineering company with particular depth in PostgreSQL®.

In this blog, we’d like to briefly cover how using cloud-native or “container attached” storage can help in the deployment and ongoing operations of PostgreSQL® on Kubernetes. This is the first of a series of blogs we are considering — this one focuses more on why users are adopting this pattern and future ones will dive more into the specifics of how they are doing so.

At the end you can see how to use a Storage Class and a preferred operator to deploy PostgreSQL® with OpenEBS underlying

If you are curious about what container attached storage of CAS is you can read more from the Cloud Native Computing Foundation (CNCF) here.

Conceptually you can think of CAS as being the decomposition of previously monolithic storage software into containerized microservices that themselves run on Kubernetes. This gives all the advantages of running Kubernetes that already led you to run Kubernetes — now applied to the storage and data management layer as well. Of special note is that like Kubernetes, OpenEBS runs anywhere so the same advantages below apply whether on on-premises or on any of the many hosted Kubernetes services.

PostgreSQL® plus OpenEBS

®-with-OpenEBS-persistent-volumes.png”>Postgres-Operator (for cluster deployment)

  • Docker installed
  • Kubernetes 1.9+ cluster installed
  • kubectl installed
  • OpenEBS installed
  • Install OpenEBS

    1. If OpenEBS is not installed in your K8s cluster, this can be done from here. If OpenEBS is already installed, go to the next step.
    2. Connect to MayaOnline (Optional): Connecting the Kubernetes cluster to MayaOnline provides good visibility of storage resources. MayaOnline has various support options for enterprise customers.

    Configure cStor Pool

    1. If cStor Pool is not configured in your OpenEBS cluster, this can be done from here. As PostgreSQL® is a StatefulSet application, it requires a single storage replication factor. If you prefer additional redundancy you can always increase the replica count to 3.
      During cStor Pool creation, make sure that the maxPools parameter is set to >=3. If a cStor pool is already configured, go to the next step. Sample YAML named openebs-config.yaml for configuring cStor Pool is provided in the Configuration details below.

    openebs-config.yaml

    #Use the following YAMLs to create a cStor Storage Pool.
    # and associated storage class.
    apiVersion: openebs.io/v1alpha1
    kind: StoragePoolClaim
    metadata:
     name: cstor-disk
    spec:
     name: cstor-disk
     type: disk
     poolSpec:
     poolType: striped
     # NOTE — Appropriate disks need to be fetched using `kubectl get disks`
     #
     # `Disk` is a custom resource supported by OpenEBS with `node-disk-manager`
     # as the disk operator
    # Replace the following with actual disk CRs from your cluster `kubectl get disks`
    # Uncomment the below lines after updating the actual disk names.
     disks:
     diskList:
    # Replace the following with actual disk CRs from your cluster from `kubectl get disks`
    # — disk-184d99015253054c48c4aa3f17d137b1
    # — disk-2f6bced7ba9b2be230ca5138fd0b07f1
    # — disk-806d3e77dd2e38f188fdaf9c46020bdc
    # — disk-8b6fb58d0c4e0ff3ed74a5183556424d
    # — disk-bad1863742ce905e67978d082a721d61
    # — disk-d172a48ad8b0fb536b9984609b7ee653
     — -

    Create Storage Class

    1. You must configure a StorageClass to provision cStor volume on a cStor pool. In this solution, we are using a StorageClass to consume the cStor Pool which is created using external disks attached on the Nodes. The storage pool is created using the steps provided in the Configure StoragePool section. In this solution, PostgreSQL® is a deployment. Since it requires replication at the storage level the cStor volume replicaCount is 3. Sample YAML named openebs-sc-pg.yaml to consume cStor pool with cStorVolume Replica count as 3 is provided in the configuration details below.

    openebs-sc-pg.yaml

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: openebs-postgres
      annotations:
        openebs.io/cas-type: cstor
        cas.openebs.io/config: |
          - name: StoragePoolClaim
            value: "cstor-disk"
          - name: ReplicaCount
            value: "3"       
    provisioner: openebs.io/provisioner-iscsi
    reclaimPolicy: Delete
    ---

    Launch and test Postgres Operator

    1. Clone Zalando’s Postgres Operator.
    git clone https://github.com/zalando/postgres-operator.git
    cd postgres-operator

    Use the OpenEBS storage class

    1. Edit manifest file and add openebs-postgres as the storage class.
    nano manifests/minimal-postgres-manifest.yaml

    After adding the storage class, it should look like the example below:

    apiVersion: "acid.zalan.do/v1"
    kind: postgresql
    metadata:
      name: acid-minimal-cluster
      namespace: default
    spec:
      teamId: "ACID"
      volume:
        size: 1Gi
        storageClass: openebs-postgres
      numberOfInstances: 2
      users:
        # database owner
        zalando:
        - superuser
        - createdb
     
    # role for application foo
        foo_user: []
     
    #databases: name->owner
      databases:
        foo: zalando
      postgresql:
        version: "10"
        parameters:
          shared_buffers: "32MB"
          max_connections: "10"
          log_statement: "all"

    Start the Operator

    1. Run the command below to start the operator
    kubectl create -f manifests/configmap.yaml # configuration
    kubectl create -f manifests/operator-service-account-rbac.yaml # identity and permissions
    kubectl create -f manifests/postgres-operator.yaml # deployment

    Create a Postgres cluster on OpenEBS

    Optional: The operator can run in a namespace other than default. For example, to use the test namespace, run the following before deploying the operator’s manifests:

    kubectl create namespace test
    kubectl config set-context $(kubectl config current-context) — namespace=test
    1. Run the command below to deploy from the example manifest:
    kubectl create -f manifests/minimal-postgres-manifest.yaml

    2. It only takes a few seconds to get the persistent volume (PV) for the pgdata-acid-minimal-cluster-0 up. Check PVs created by the operator using the kubectl get pv command:

    $ kubectl get pv
    NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
    pvc-8852ceef-48fe-11e9–9897–06b524f7f6ea 1Gi RWO Delete Bound default/pgdata-acid-minimal-cluster-0 openebs-postgres 8m44s
    pvc-bfdf7ebe-48fe-11e9–9897–06b524f7f6ea 1Gi RWO Delete Bound default/pgdata-acid-minimal-cluster-1 openebs-postgres 7m14s

    Connect to the Postgres master and test

    1. If it is not installed previously, install psql client:
    sudo apt-get install postgresql-client

    2. Run the command below and note the hostname and host port.

    kubectl get service — namespace default |grep acid-minimal-cluster

    3. Run the commands below to connect to your PostgreSQL® DB and test. Replace the [HostPort] below with the port number from the output of the above command:

    export PGHOST=$(kubectl get svc -n default -l application=spilo,spilo-role=master -o jsonpath="{.items[0].spec.clusterIP}")
    export PGPORT=[HostPort]
    export PGPASSWORD=$(kubectl get secret -n default postgres.acid-minimal-cluster.credentials -o ‘jsonpath={.data.password}’ | base64 -d)
    psql -U postgres -c ‘create table foo (id int)’

    Congrats you now have the Postgres-Operator and your first test database up and running with the help of cloud-native OpenEBS storage.

    Partnership and future direction

    As this blog indicates, the teams at MayaData / OpenEBS and credativ are increasingly working together to help organizations running PostgreSQL® and other stateful workloads. In future blogs, we’ll provide more hands-on tips.

    We are looking for feedback and suggestions on where to take this collaboration. Please provide feedback below or find us on Twitter or on the OpenEBS slack community.

    Patroni is a PostgreSQL high availability solution with a focus on containers and Kubernetes. Until recently, the available Debian packages had to be configured manually and did not integrate well with the rest of the distribution. For the upcoming Debian 10 “Buster” release, the Patroni packages have been integrated into Debian’s standard PostgreSQL framework by credativ. They now allow for an easy setup of Patroni clusters on Debian or Ubuntu.

    Patroni employs a “Distributed Consensus Store” (DCS) like Etcd, Consul or Zookeeper in order to reliably run a leader election and orchestrate automatic failover. It further allows for scheduled switchovers and easy cluster-wide changes to the configuration. Finally, it provides a REST interface that can be used together with HAProxy in order to build a load balancing solution. Due to these advantages Patroni has gradually replaced Pacemaker as the go-to open-source project for PostgreSQL high availability.

    However, many of our customers run PostgreSQL on Debian or Ubuntu systems and so far Patroni did not integrate well into those. For example, it does not use the postgresql-common framework and its instances were not displayed in pg_lsclusters output as usual.

    Integration into Debian

    In a collaboration with Patroni lead developer Alexander Kukushkin from Zalando the Debian Patroni package has been integrated into the postgresql-common framework to a large extent over the last months. This was due to changes both in Patroni itself as well as additional programs in the Debian package. The current Version 1.5.5 of Patroni contains all these changes and is now available in Debian “Buster” (testing) in order to setup Patroni clusters.

    The packages are also available on apt.postgresql.org and thus installable on Debian 9 “Stretch” and Ubuntu 18.04 “Bionic Beaver” LTS for any PostgreSQL version from 9.4 to 11.

    The most important part of the integration is the automatic generation of a suitable Patroni configuration with the pg_createconfig_patroni command. It is run similar to pg_createcluster with the desired PostgreSQL major version and the instance name as parameters:

    pg_createconfig_patroni 11 test
    

    This invocation creates a file /etc/patroni/11-test.yml, using the DCS configuration from /etc/patroni/dcs.yml which has to be adjusted according to the local setup. The rest of the configuration is taken from the template /etc/patroni/config.yml.in which is usable in itself but can be customized by the user according to their needs. Afterwards the Patroni instance is started via systemd similar to regular PostgreSQL instances:

    systemctl start patroni@11-test
    

    A simple 3-node Patroni cluster can be created and started with the following few commands, where the nodes pg1, pg2 and pg3 are considered to be hostnames and the local file dcs.yml contains the DCS configuration:

    
    for i in pg1 pg2 pg3; do ssh $i 'apt -y install postgresql-common'; done
    for i in pg1 pg2 pg3; do ssh $i 'sed -i "s/^#create_main_cluster = true/create_main_cluster = false/" /etc/postgresql-common/createcluster.conf'; done
    for i in pg1 pg2 pg3; do ssh $i 'apt -y install patroni postgresql'; done
    for i in pg1 pg2 pg3; do scp ./dcs.yml $i:/etc/patroni; done
    for i in pg1 pg2 pg3; do ssh @$i 'pg_createconfig_patroni 11 test' && systemctl start patroni@11-test'; done
    

    Afterwards, you can get the state of the Patroni cluster via

    ssh pg1 'patronictl -c /etc/patroni/11-patroni.yml list'
    +---------+--------+------------+--------+---------+----+-----------+
    | Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
    +---------+--------+------------+--------+---------+----+-----------+
    | 11-test |  pg1   | 10.0.3.111 | Leader | running |  1 |           |
    | 11-test |  pg2   | 10.0.3.41  |        | stopped |    |   unknown |
    | 11-test |  pg3   | 10.0.3.46  |        | stopped |    |   unknown |
    +---------+--------+------------+--------+---------+----+-----------+
    

    Leader election has happened and pg1 has become the primary. It created its instance with the Debian-specific pg_createcluster_patroni program that runs pg_createcluster in the background. Then the two other nodes clone from the leader using the pg_clonecluster_patroni program which sets up an instance using pg_createcluster and then runs pg_basebackup from the primary. After that, all nodes are up and running:

    +---------+--------+------------+--------+---------+----+-----------+
    | Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
    +---------+--------+------------+--------+---------+----+-----------+
    | 11-test |  pg1   | 10.0.3.111 | Leader | running |  1 |         0 |
    | 11-test |  pg2   | 10.0.3.41  |        | running |  1 |         0 |
    | 11-test |  pg3   | 10.0.3.46  |        | running |  1 |         0 |
    +---------+--------+------------+--------+---------+----+-----------+
    

    The well-known Debian postgresql-common commands work as well:

    ssh pg1 'pg_lsclusters'
    Ver Cluster Port Status Owner    Data directory                 Log file
    11  test    5432 online postgres /var/lib/postgresql/11/test    /var/log/postgresql/postgresql-11-test.log
    

    Failover Behaviour

    If the primary is abruptly shutdown, its leader token will expire after a while and Patroni will eventually initiate failover and a new leader election:

    +---------+--------+-----------+------+---------+----+-----------+
    | Cluster | Member |    Host   | Role |  State  | TL | Lag in MB |
    +---------+--------+-----------+------+---------+----+-----------+
    | 11-test |  pg2   | 10.0.3.41 |      | running |  1 |         0 |
    | 11-test |  pg3   | 10.0.3.46 |      | running |  1 |         0 |
    +---------+--------+-----------+------+---------+----+-----------+
    [...]
    +---------+--------+-----------+--------+---------+----+-----------+
    | Cluster | Member |    Host   |  Role  |  State  | TL | Lag in MB |
    +---------+--------+-----------+--------+---------+----+-----------+
    | 11-test |  pg2   | 10.0.3.41 | Leader | running |  2 |         0 |
    | 11-test |  pg3   | 10.0.3.46 |        | running |  1 |         0 |
    +---------+--------+-----------+--------+---------+----+-----------+
    [...]
    +---------+--------+-----------+--------+---------+----+-----------+
    | Cluster | Member |    Host   |  Role  |  State  | TL | Lag in MB |
    +---------+--------+-----------+--------+---------+----+-----------+
    | 11-test |  pg2   | 10.0.3.41 | Leader | running |  2 |         0 |
    | 11-test |  pg3   | 10.0.3.46 |        | running |  2 |         0 |
    +---------+--------+-----------+--------+---------+----+-----------+
    

    The old primary will rejoin the cluster as standby once it is restarted:

    +---------+--------+------------+--------+---------+----+-----------+
    | Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
    +---------+--------+------------+--------+---------+----+-----------+
    | 11-test |  pg1   | 10.0.3.111 |        | running |    |   unknown |
    | 11-test |  pg2   | 10.0.3.41  | Leader | running |  2 |         0 |
    | 11-test |  pg3   | 10.0.3.46  |        | running |  2 |         0 |
    +---------+--------+------------+--------+---------+----+-----------+
    [...]
    +---------+--------+------------+--------+---------+----+-----------+
    | Cluster | Member |    Host    |  Role  |  State  | TL | Lag in MB |
    +---------+--------+------------+--------+---------+----+-----------+
    | 11-test |  pg1   | 10.0.3.111 |        | running |  2 |         0 |
    | 11-test |  pg2   | 10.0.3.41  | Leader | running |  2 |         0 |
    | 11-test |  pg3   | 10.0.3.46  |        | running |  2 |         0 |
    +---------+--------+------------+--------+---------+----+-----------+
    

    If a clean rejoin is not possible due to additional transactions on the old timeline the old primary gets re-cloned from the current leader. In case the data is too large for a quick re-clone, pg_rewind can be used. In this case a password needs to be set for the postgres user and regular database connections (as opposed to replication connections) need to be allowed between the cluster nodes.

    Creation of additional Instances

    It is also possible to create further clusters with pg_createconfig_patroni, one can either assign a PostgreSQL port explicitly via the --port option, or let pg_createconfig_patroni assign the next free port as is known from pg_createcluster:

    for i in pg1 pg2 pg3; do ssh $i 'pg_createconfig_patroni 11 test2 && systemctl start patroni@11-test2'; done
    ssh pg1 'patronictl -c /etc/patroni/11-test2.yml list'
    +----------+--------+-----------------+--------+---------+----+-----------+
    | Cluster  | Member |       Host      |  Role  |  State  | TL | Lag in MB |
    +----------+--------+-----------------+--------+---------+----+-----------+
    | 11-test2 |  pg1   | 10.0.3.111:5433 | Leader | running |  1 |         0 |
    | 11-test2 |  pg2   |  10.0.3.41:5433 |        | running |  1 |         0 |
    | 11-test2 |  pg3   |  10.0.3.46:5433 |        | running |  1 |         0 |
    +----------+--------+-----------------+--------+---------+----+-----------+
    

    Ansible Playbook

    In order to easily deploy a 3-node Patroni cluster we have created an Ansible playbook on Github. It automates the installation and configuration of PostgreSQL and Patroni on the three nodes, as well as the DCS server on a fourth node.

    Questions and Help

    Do you have any questions or need help? Feel free to write to info@credativ.com.