153 participants had the opportunity to attend a total of 21 talks and visit 15 different sponsors, discussing all possible topics related to PostgreSQL and the community.
Also present this time was the Sheldrick Wildlife Trust, which is dedicated to rescuing elephants and rhinos. Attendees could learn about the project, make donations, and participate in a raffle.
The talks ranged from topics such as benchmarking and crash recovery to big data.
Our colleague was also represented with his talk “Postgres with many data: To MAXINT and beyond“.
As a special highlight, at the end of the day, before the networking event, and in addition to the almost obligatory lightning talks, there was a “Celebrity DB Deathmatch” where various community representatives came together for a very entertaining stage performance to find the best database in different disciplines. To everyone’s (admittedly not great) surprise, PostgreSQL was indeed able to excel in every category.
Additionally, our presence with our own booth gave us the opportunity to have many very interesting conversations and discussions with various community members, as well as sponsors and visitors in general.
For the first time, the new managing director of credativ GmbH was also on site after our re-independence and saw things for himself.
All in all, it was a (still) somewhat smaller, but nonetheless, as always, a very instructive and familiar event, and we are already looking forward to the next one and thank the organizers and the entire team on site and behind the scenes.
This weekend, it was time once again for FrOScon 2025. With perfect summer weather – not too hot, not too cold – the event took place at the university in Sankt Augustin, as it does every year. But this year was different: FrOScon celebrated its 20th anniversary. I have a special connection to the conference. Not only is it the largest local open source conference, but I have also been a part of it for many years. I have given countless presentations, organised Debian stands, and run and organised developer tracks. In 2007, I even had the pleasure of being part of the organising team. In a way, FrOScon is like coming home. Everywhere you look, you see familiar faces you've known for many years, former and current colleagues, good friends from the open source scene. The only comparable event is Fosdem in Brussels – which is not surprising, as Fosdem is the great role model for FrOScon.
A journey through time – 20 years of FrOScon
I was already involved in the first FrOScon as an exhibitor for the Debian project, speaker and organiser of the Debian track (together with my colleague Martin Zobel-Helas, who is still involved). I could probably still give the presentation on techniques for combating spam today without changing a word, as the topic has lost none of its relevance. The first FrOScon closed its doors with a very respectable attendance of around 300 visitors. I was so enthusiastic that I got directly involved in the organisation the following year – this was also the year we introduced the famous bouncy castle ;). Over the course of 20 years, FrOScon has developed into one of the largest German OSS community events and now attracts around 1,500 visitors every year. Over the years, the conference has also attracted various prominent keynote speakers, including Andrew Tanenbaum and Richard Stallman, to name but a few. Then as now, I can say that FrOScon is the best conference in the West.
FrOScon 2025
Let's get to the actual content of this blog post – FrOScon 2025. We were represented this year with two presentations and several colleagues as visitors. My colleague Patrick Lauer dedicated one of the rooms to the topic of "Postgres with many data". Thanks to c3voc, the presentation is also available as a stream for anyone who was unable to attend, so everyone can fill this knowledge gap at a later date.
I also gave a presentation again this year. The topic was Proxmox VE and external block devices. I talked about the different types of block devices, such as ISCSI, Fibre Channel and NVMEoF, and how to use them. This presentation is also available in the c3voc video archive.

Conclusion
This FrOScon was once again a successful event for community members and interested professionals. It's a bit like a class reunion: you see lots of familiar faces and are excited to find out what everyone has been up to since you last met. Even if I'm hopefully retired by then, I would be delighted to give a presentation at FrOScon #40 ;).
I created a new project directory and started experimenting.
Environment and Objective
The goal was to set up PostgreSQL 18 Beta with OAuth2 authentication, and for the identity provider, I chose Keycloak.
I decided to use Docker Compose to quickly and easily spin up both services.
Docker Compose Configuration
Here is the docker-compose.yml
, that I used:
services:
mock-oauth2:
image: quay.io/keycloak/keycloak:24.0
command: start --https-port=8080 --https-key-store-file=/etc/x509/https/localhost.p12 --https-key-store-password=changeit --import-realm --hostname-strict=false --hostname-url=https://mock-oauth2:8080
ports:
- "8080:8080"
environment:
KEYCLOAK_ADMIN: admin
KEYCLOAK_ADMIN_PASSWORD: admin
volumes:
- ./certs:/etc/x509/https
- ./keycloak-realm.json:/opt/keycloak/data/import/realm.json
networks:
- pgnet
postgres18:
build: .
ports:
- "5432:5432"
environment:
POSTGRES_PASSWORD: postgres
volumes:
- ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf
- ./postgres/pg_hba.conf:/etc/postgresql/pg_hba.conf
command: ["-c", "config_file=/etc/postgresql/postgresql.conf"]
networks:
- pgnet
networks:
pgnet:
driver: bridge
PostgreSQL expects a connection to the OAuth2 issuer over HTTPS. The same URL must also be reachable from the host, for example, when using the Keycloak UI.
This means: The certificate must be valid and trusted on both the host and in the container.
The hostname used (e.g., https://mock-oauth2:8080
) must be correctly resolvable in both environments.
For this, I added the following line to my /etc/hosts
file on the host:
127.0.0.1 mock-oauth2
This allowed both the PostgreSQL container and my host to reach the Keycloak service at https://mock-oauth2:8080
reach.
TLS Certificate for Keycloak
Keycloak absolutely requires an HTTPS endpoint for the OAuth2 issuer URL to be accepted by PostgreSQL.
For this, I created a self-signed certificate and converted it into a .p12
-keystore package that Keycloak can use.
The .p12
certificate is mounted into the container via the following mount:
volumes:
- ./certs:/etc/x509/https
In the certs
directory, the file localhost.p12
, which I created from my self-signed key and certificate as follows:
openssl req -x509 -nodes -days 365 \
-newkey rsa:2048 \
-keyout server.key \
-out server.crt \
-subj "/CN=mock-oauth2" \
-addext "subjectAltName = DNS:mock-oauth2,DNS:localhost,IP:127.0.0.1"
Keycloak Realm Configuration
I created a minimalist realm file for Keycloak. It contains a client named postgres
and a user with corresponding credentials.
Content of keycloak-realm.json
:
{
"realm": "pg",
"enabled": true,
"clients": [
{
"clientId": "postgres",
"enabled": true,
"publicClient": false,
"redirectUris": ["*"],
"protocol": "openid-connect",
"secret": "postgres",
"directAccessGrantsEnabled": true,
"standardFlowEnabled": true
}
],
"users": [
{
"username": "postgres",
"enabled": true,
"credentials": [
{
"type": "password",
"value": "postgres"
}
]
}
]
}
After importing the realm, Keycloak was ready and the default scope was visible in the UI.
Installation of libpq-oauth
and oauth_validator
I had to extend the official PostgreSQL image to install additional dependencies such as the extension libpq-oauth
as well as the validator oauth_validator
to install.
PostgreSQL 18 experimentally supports OAuth2 authentication. However, PostgreSQL does not include its own validator library. In the official documentation, it states:
The PostgreSQL distribution does not include any libraries for
OAuth2 token validation. Users must provide their own solution
or compile it themselves.
PostgreSQL
Docs –oauth_validator_libraries
For testing, I used the following open-source implementation:
This minimalist C library can be compiled and used as oauth_validator_library
in PostgreSQL.
Used Dockerfile
FROM postgres:18beta1
USER root
RUN apt-get update \
&& apt-get install -y libpq-oauth build-essential libkrb5-dev \
libsasl2-dev libcurl4-openssl-dev postgresql-server-dev-18 git \
&& git clone https://github.com/TantorLabs/oauth_validator.git /tmp/oauth_validator \
&& cd /tmp/oauth_validator \
&& make && make install \
&& rm -rf /tmp/oauth_validator \
&& apt-get remove -y build-essential git \
&& apt-get autoremove -y && rm -rf /var/lib/apt/lists/*
I then used this image for the `postgres18` service in my Docker Compose setup.
Trusting the Keycloak CA with PostgreSQL
PostgreSQL must trust the certificate presented by Keycloak, otherwise the connection to the OAuth2 issuer will be rejected.
For this, I copied the `mock-oauth.crt` file into the PostgreSQL container and placed it in the typical CA path:
/usr/local/share/ca-certificates/
Then, inside the container, I executed the following command:
update-ca-certificates
After that, the certificate was accepted and PostgreSQL could successfully verify the connection to the HTTPS issuer.
PostgreSQL Configuration
In pg_hba.conf
, I added the following line:
host all all all oauth scope="profile" issuer="https://mock-oauth2:8080/realms/pg" map="oauthmap"
In pg_ident.conf
, I mapped the identity provided by the token to the PostgreSQL user:
oauthmap "postgresID" "postgres"
This mapping may need to be adjusted – depending on how your Keycloak client is configured and which field (e.g.,preferred_username
or sub
) is passed in the token.
Connection Test with OAuth2
To test the connection, I used the following `psql` command:
psql "host=localhost \
port=5432 \
dbname=postgres \
user=postgres \
oauth_issuer=https://mock-oauth2:8080/realms/pg \
oauth_client_id=postgres \
oauth_client_secret=changeme \
oauth_scope=profile"
After the call, a device code message appears, such as:
Visit `https://mock-oauth2:8080/realms/pg/device` and enter
the code FBAD-XXYZ.
After logging in with the user credentials, `psql` successfully establishes a connection to PostgreSQL via OAuth2.
Insights and Tips
- PostgreSQL 18 requires HTTPS for the OAuth2 issuer URL – even in a local setup.
pg_hba.conf
is sensitive to formatting errors. I had to reload the configuration multiple times (SELECT pg_reload_conf();
) and carefullyanalyze the logs.- To trust a local certificate authority, it is sufficient to copy the
.crt
certificate into the container and register it withupdate-ca-certificates
. - Keycloak is well-suited for testing with OAuth2, but you may need to experiment with scopes, claims, and secrets until everything aligns with PostgreSQL.
Conclusion
This was an exciting, practical experiment with a promising new feature in PostgreSQL. OAuth2 integration brings PostgreSQL closer to modern identity management solutions and simplifies operation in environments with centralized authentication.
On Thursday, 26 June and Friday, 27 June 2025, my colleague Patrick Lauer and I had the amazing opportunity to attend Swiss PGDay 2025, held at the OST Eastern Switzerland University of Applied Sciences in Rapperswil. This two-day PostgreSQL conference featured two parallel tracks of presentations in English and German, bringing together users and experts primarily from across Switzerland. Our company, credativ, was among the supporters of this year’s conference.
During the event, Patrick delivered an engaging session titled “Postgres with many data: To MAXINT and beyond,” which built on past discussions about massive-scale Postgres usage. He highlighted the practical issues that arise when handling extremely large datasets in PostgreSQL – for instance, how even a simple SELECT COUNT(*) can become painfully slow, and how backups and restores can take days on very large datasets. He also shared strategies to manage performance effectively at these scales.
I presented a significantly updated version of my talk, “Building a Data Lakehouse with PostgreSQL: Dive into Formats, Tools, Techniques, and Strategies.” It covered modern data formats and frameworks such as Apache Iceberg, addressing key challenges in lakehouse architectures – from governance, privacy, and compliance, to data quality checks and AI/ML use cases. The talk emphasized PostgreSQL’s capability to play a central role in today’s data lakehouse and AI landscape. At the close of the conference, I delivered a brief lightning talk showcasing our new open-source migration tool, “credativ-pg-migrator.”
(c) photos by Gülçin Yıldırım Jelinek
The conference schedule was packed with many high-quality, insightful talks. We would particularly like to highlight:
* Bruce Momjian – “How Open Source and Democracy Drive Postgres”: In his keynote, Bruce Momjian outlined how PostgreSQL’s open-source development model and democratic governance have powered its success. He explained the differences between open-source and proprietary models, reviewed PostgreSQL’s governance history, and illustrated how democratic, open processes result in robust software and a promising future for Postgres.
* Gülçin Yıldırım Jelinek – “Anatomy of Table-Level Locks in PostgreSQL”: session covered the fundamentals of PostgreSQL’s table-level locking mechanisms. Explained how different lock modes are acquired and queued during schema changes, helping attendees understand how to manage lock conflicts, minimize downtime, and avoid deadlocks during high-concurrency DDL operations.
* Aarno Aukia – “Operating PostgreSQL at Scale: Lessons from Hundreds of Instances in Regulated Private Clouds”: the speaker shared lessons from running extensive Postgres environments in highly regulated industries. He discussed architectural patterns, automation strategies, and “day-2 operations” practices that VSHN uses to meet stringent availability, compliance, and audit requirements, including secure multi-tenancy, declarative deployments, backups, monitoring, and lifecycle management in mission-critical cloud-native setups.
* Bertrand Hartwig-Peillon – “pgAssistant”: Author introduced pgAssistant, an open-source tool designed to help developers optimize PostgreSQL schemas and queries before production deployment. He demonstrated how pgAssistant combines deterministic analysis with an AI-driven approach to detect schema inconsistencies and suggest optimizations, effectively automating best practices and performance tuning within development workflows.
* Gianni Ciolli – “The Why and What of WAL”: Gianni Ciolli provided in a great Italian style concise history and overview of PostgreSQL’s Write-Ahead Log (WAL). He explained WAL’s central role in PostgreSQL for crash safety, backups, and replication, showcasing examples of WAL-enabled features like fast crash recovery, efficient hot backups, physical replication, and logical decoding.
* Daniel Krefl – “Hacking pgvector for performance”: The speaker presented an enhanced version of the pgvector extension for massive data processing, optimized by maintaining the vector index outside PostgreSQL memory and offloading computations, including GPU integration. He detailed the process of moving pgvector’s core logic externally for improved speed, demonstrating notable performance gains in the EU AERO project context. He also talked about distributed PostgreSQL XC, XL and TBase, which are unfortunately stuck on the old version 10 and how he ported changes from these projects into the version 16.
* Luigi Nardi – “A benchmark study on the impact of PostgreSQL server parameter tuning”: Luigi Nardi presented comprehensive benchmark results on tuning PostgreSQL configuration parameters. Highlighting that many users default settings, he demonstrated how significant performance improvements can be achieved through proper tuning across various workloads (OLTP, OLAP, etc.), providing actionable insights tailored to specific environments.
* Renzo Dani – “From Oracle to PostgreSQL: A HARD Journey and an Open-Source Awakening”: Author recounted his experiences migrating a complex enterprise application from Oracle to PostgreSQL, addressing significant challenges such as implicit type casting, function overloading differences, JDBC driver issues, and SQL validation problems. He also highlighted the benefits, including faster CI pipelines, more flexible deployments, and innovation opportunities provided by open-source Postgres, along with practical advice on migration tools, testing strategies, and managing trade-offs.
(c) photo by Swiss PostgreSQL User Group
At the end of the first day, all participants enjoyed a networking dinner. We both want to sincerely thank the Swiss PGDay organizers (Swiss PostgreSQL User Group) for an amazing event. Swiss PGDay 2025 was a memorable and valuable experience, offering great learning and networking opportunities. We are also very grateful to credativ for enabling our participation, and we look forward to future editions of this excellent conference.
Many companies these days are thinking about migrating their databases from legacy or proprietary system to PostgreSQL. The primary aim is to reduce costs, enhance capabilities, and ensure long-term sustainability. However, even just the idea of migrating to PostgreSQL can be overwhelming. Very often, knowledge about the legacy applications is limited or even lost. In some cases, vendor support is diminishing, and expert pools and community support are shrinking. Legacy databases are also often running on outdated hardware and old operating systems, posing further risks and limitations. (more…)
We are very pleased to be hosting the 5th PostgreSQL User Group NRW Meetup – in our new offices right next to Borussia Park in Mönchengladbach! (more…)
PGConf.DE 2025, the 9th Annual PostgreSQL Conference Germany, was held on May 8–9, 2025, at the Marriott Hotel near Potsdamer Platz in Berlin. The event interconnected PostgreSQL enthusiasts, developers, DBAs, and industry sponsors for two days of fascinating talks across four parallel tracks. It was the biggest event so far, with 347 attendees. The whole conference was very well organized, and therefore special thanks are due to all the organizers—in particular Andreas Scherbaum, the main organizer—for their efforts and hard work.
Our company, credativ GmbH, being independent again, participated as a gold sponsor. The credativ CTO Alexander Wirt, Head of Sales & Marketing Peter Dreuw and team leader of Database team Tobias Kauder, were available for attendees at the credativ booth. Many thanks to our team colleague Sascha Spettmann for delivering all the stands and billboards to the conference and back again.
In total, we held four talks at the conference. Michael Banck, technical leader of our database team, presented the German-language talk “PostgreSQL Performance Tuning.” He provided a deep and comprehensive overview of the most important performance-tuning parameters in PostgreSQL and explained how they influence the database’s behavior. His talk attracted a large audience and was very well received.
I had an absolutely unique opportunity to present three different talks in the English track. In my regular talk “PostgreSQL Connections Memory Usage: How Much, Why and When,” I presented the results of my research and tests on PostgreSQL connections’ memory usage. After explaining the most important aspects of Linux memory management and measurements of memory usage reported by standard commands, I detailed PostgreSQL connection memory usage during query execution based on numbers reported in smaps files. I intend to publish detailed blog posts about my findings soon. My other talk, “Building a Data Lakehouse with PostgreSQL,” was originally chosen as a reserve talk in case of a last-minute cancellation. Unfortunately, this indeed happened: the talk “Creating a Board Game Chatbot with Postgres, AI, and RAG” by Matt Cornillon had to be replaced. The speaker could not attend because his flight was unexpectedly canceled at very short notice.
In the sponsor track, credativ CTO Alexander Wirt and I presented a talk titled “Your Data Deserves the Best: Migration to PostgreSQL.” It featured our new migration tool, “credativ-pg-migrator.” It is capable of migrating data models (tables, data, indexes, constraints, and views) from Informix, IBM DB2 LUW, MS SQL Server, Sybase ASE, SQL Anywhere, and MySQL/MariaDB. In the case of Informix, it can also convert stored procedures, functions, and triggers into PostgreSQL PL/pgSQL. We will share more details about this tool in a separate blog post.
Since there were always three or four parallel talks, I had to carefully choose which sessions to attend. I greatly enjoyed the talk “András in Windowsland – a DBA’s (mis)adventures” by András Váczi from Cybertec. The speaker presented many useful tips for accessing and troubleshooting PostgreSQL on Windows. I also enjoyed the German talk “Modern VACUUM,” delivered by Christoph Berg from Cybertec. He provided valuable insights into the history and implementation details of the VACUUM command and autovacuum background processes. Another very interesting talk was the German presentation “Modernes SSL ideal einsetzen” by Peter Eisentraut from EDB. The talk covered selecting appropriate protocol versions and cryptographic cipher suites, managing keys and certificates, and configuring client/server settings to meet contemporary security standards. The talk “Comparing the Oracle and PostgreSQL transaction systems,” delivered by Laurenz Albe from Cybertec, received a lot of well-deserved attention. Key topics included Oracle’s undo/redo mechanism versus PostgreSQL’s MVCC approach, differences in isolation level defaults and anomalies, and how each database implements constraints and transactional DDL. Last but not least, I want to mention the talk “What is an SLRU anyway?” delivered by major PostgreSQL contributor Álvaro Herrera. He explained that SLRUs are essentially circular logs with an in-memory cache used for tracking information such as committed transactions or snapshot data and he highlighted the significance of PostgreSQL 17’s innovations which made SLRU cache sizes configurable. Unfortunately, the talks were not recorded, but slides for the majority of the talks are already available on the conference website.
The whole event was highly informative and provided excellent networking opportunities. We are very much looking forward to participating in the next PGConf.DE. In the meantime, stay tuned to all credativ news, follow us on social media – LinkedIn and Mastodon.
If you are interested in our PostgreSQL related services, click here!
The Prague PostgreSQL Developer Day (P2D2) is a well-established Czech PostgreSQL conference. This year’s 17th edition was exceptional, with 275 registered visitors and 16 talks across two tracks. Notably, several major PostgreSQL contributors and core members were present, underlining the event’s significance. Tomas Vondra, as in previous years, organized the conference. Bruce Momjian, Vice President and Postgres Evangelist at EDB, led a half-day workshop titled “Postgres & AI: From the Trenches to the Sky.” Robert Haas delivered a really interesting talk on Incremental Backup in PostgreSQL 17, a feature he developed.
I had the fantastic opportunity to contribute to this conference with my new talk, “Building a Data Lakehouse with PostgreSQL: Dive into Formats, Tools, Techniques, and Strategies.” Given the still evolving nature of this topic and the varying definitions of data lakehouses, I covered the most important areas, highlighting the functionalities and extensions PostgreSQL offers. I received very positive feedback on my talk and had several insightful discussions about the topic with different people.
Among the talks I attended, I particularly enjoyed Robert Haas’s presentation on Incremental Backups, the practical demonstration of PostgreSQL statistics in the talk “Statistics: How PostgreSQL Counts Without Counting” by Sadeq Dousti, and the very interesting presentation “Anatomy of Table-Level Locks in PostgreSQL” by Gülçin Yıldırım Jelínek. She explained in detail the hierarchy of different locking levels and the events leading to delays in operations due to locks. Other notable talks included “Replicating Schema Changes with PostgreSQL” by Esther Miñano and “When Autovacuum Met FinOps: A Cloud Romance” by Mayuresh Bagayatkar. You can find summaries of all talks and soon also the slides on the P2D2 website. I want to express my gratitude to all the organizers of this amazing event, especially Tomas Vondra and Gülçin Yıldırım Jelínek, who both do a great job in organizing PostgreSQL events not only in the Czech Republic.
The issue of table and index bloat due to failed inserts on unique constraints is well known and has been discussed in various articles across the internet. However, these discussions sometimes lack a clear, practical example with measurements to illustrate the impact. And despite the familiarity of this issue, we still frequently see this design pattern—or rather, anti-pattern—in real-world applications. Developers often rely on unique constraints to prevent duplicate values from being inserted into tables. While this approach is straightforward, versatile, and generally considered effective, in PostgreSQL, inserts that fail due to unique constraint violations unfortunately always lead to table and index bloat. And on high-traffic systems, this unnecessary bloat can significantly increase disk I/O and the frequency of autovacuum runs. In this article, we aim to highlight this problem once again and provide a straightforward example with measurements to illustrate it. We suggest simple improvement that can help mitigate this issue and reduce autovacuum workload and disk I/O.
Two Approaches to Duplicate Prevention
In PostgreSQL, there are two main ways to prevent duplicate values using unique constraints:
1. Standard Insert Command (INSERT INTO table)
The usual INSERT INTO table command attempts to insert data directly into the table. If the insert would result in a duplicate value, it fails with a “duplicate key value violates unique constraint” error. Since the command does not specify any duplicate checks, PostgreSQL internally immediately inserts the new row and only then begins updating indexes. When it encounters a unique index violation, it triggers the error and deletes the newly added row. The order of index updates is determined by their relation IDs, so the extent of index bloat depends on the order in which indexes were created. With repeated “unique constraint violation” errors, both the table and some indexes accumulate deleted records leading to bloat, and the resulting write operations increase disk I/O without achieving any useful outcome.
2. Conflict-Aware Insert (INSERT INTO table … ON CONFLICT DO NOTHING)
The INSERT INTO table ON CONFLICT DO NOTHING command behaves differently. Since it specifies that a conflict might occur, PostgreSQL first checks for potential duplicates before attempting to insert data. If a duplicate is found, PostgreSQL performs the specified action—in this case, “DO NOTHING”—and no error occurs. This clause was introduced in PostgreSQL 9.5, but some applications either still run on older PostgreSQL versions or retain legacy code when the database is upgraded. As a result, this conflict-handling option is often underutilized.
Testing Example
To be able to do testing we must start PostgreSQL with “autovacuum=off”. Otherwise with instance mostly idle, autovacuum will immediately process bloated objects and it would be unable to catch statistics. We create a simple testing example with multiple indexes:
CREATE TABLE IF NOT EXISTS test_unique_constraints( id serial primary key, unique_text_key text, unique_integer_key integer, some_other_bigint_column bigint, some_other_text_column text); CREATE INDEX test_unique_constraints_some_other_bigint_column_idx ON test_unique_constraints (some_other_bigint_column ); CREATE INDEX test_unique_constraints_some_other_text_column_idx ON test_unique_constraints (some_other_text_column ); CREATE INDEX test_unique_constraints_unique_text_key_unique_integer_key__idx ON test_unique_constraints (unique_text_key, unique_integer_key, some_other_bigint_column ); CREATE UNIQUE test_unique_constraints_unique_integer_key_idx INDEX ON test_unique_constraints (unique_text_key ); CREATE UNIQUE test_unique_constraints_unique_text_key_idx INDEX ON test_unique_constraints (unique_integer_key );
And now we populate this table with unique data:
DO $$ BEGIN FOR i IN 1..1000 LOOP INSERT INTO test_unique_constraints (unique_text_key, unique_integer_key, some_other_bigint_column, some_other_text_column) VALUES (i::text, i, i, i::text); END LOOP; END; $$;
In the second step, we use a simple Python script to connect to the database, attempt to insert conflicting data, and close the session after an error. First, it sends 10,000 INSERT statements that conflict with the “test_unique_constraints_unique_int_key_idx” index, then another 10,000 INSERTs conflicting with “test_unique_constraints_unique_text_key_idx”. The entire test is done in a few dozen seconds, after which we inspect all objects using the “pgstattuple” extension. The following query lists all objects in a single output:
WITH maintable AS (SELECT oid, relname FROM pg_class WHERE relname = 'test_unique_constraints') SELECT m.oid as relid, m.relname as relation, s.* FROM maintable m JOIN LATERAL (SELECT * FROM pgstattuple(m.oid)) s ON true UNION ALL SELECT i.indexrelid as relid, indexrelid::regclass::text as relation, s.* FROM pg_index i JOIN LATERAL (SELECT * FROM pgstattuple(i.indexrelid)) s ON true WHERE i.indrelid::regclass::text = 'test_unique_constraints' ORDER BY relid;
Observed Results
After running the whole test several times, we observe the following:
- The main table “test_unique_constraints” always has 1,000 live tuples, and 20,000 additional dead records, resulting in approx 85% of dead tuples in the table
- Index on primary key always shows 21,000 tuples, unaware that 20,000 of these records are marked as deleted in the main table.
- Other non unique indexes show different results in different runs, ranging between 3,000 and 21,000 records. Numbers depend on the distribution of values generated for underlying columns by the script. We tested both repeated and completely unique values. Repeated values resulted in less records in indexes, completely unique values led to full count of 21,000 records in these indexes.
- Unique indexes showed repeatedly tuple counts only between 1,000 and 1,400 in all tests. Unique index on the “unique_text_key” always shows some dead tuples in the output. Precise explanation of these numbers would require deeper inspection of these relations and code of the pgstattuple function, which is beyond scope of this article. But some small bloat is reported also here.
- Numbers reported by pgstattuple function raised questions about their accuracy, although documentation seems to lead to the conclusion that numbers should be precise on tuple level.
- Subsequent manual vacuum confirms 20,000 dead records in the main table and 54 pages removed from primary key index, and up to several dozens of pages removed from other indexes – different numbers in each run in dependency on total count of tuples in these relations as described above.
- Each failed insert also increments the Transaction ID and thus increases the database’s transaction age.
Here is one example output from the query shown above after the test run which used unique values for all columns. As we can see, bloat of non unique indexes due to failed inserts can be big.
relid | relation | table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent -------+-----------------------------------------------------------------+-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+-------------- 16418 | test_unique_constraints | 1269760 | 1000 | 51893 | 4.09 | 20000 | 1080000 | 85.06 | 5420 | 0.43 16424 | test_unique_constraints_pkey | 491520 | 21000 | 336000 | 68.36 | 0 | 0 | 0 | 51444 | 10.47 16426 | test_unique_constraints_some_other_bigint_column_idx | 581632 | 16396 | 326536 | 56.14 | 0 | 0 | 0 | 168732 | 29.01 16427 | test_unique_constraints_some_other_text_column_idx | 516096 | 16815 | 327176 | 63.39 | 0 | 0 | 0 | 101392 | 19.65 16428 | test_unique_constraints_unique_text_key_unique_integer_key__idx | 1015808 | 21000 | 584088 | 57.5 | 0 | 0 | 0 | 323548 | 31.85 16429 | test_unique_constraints_unique_text_key_idx | 57344 | 1263 | 20208 | 35.24 | 2 | 32 | 0.06 | 15360 | 26.79 16430 | test_unique_constraints_unique_integer_key_idx | 40960 | 1000 | 16000 | 39.06 | 0 | 0 | 0 | 4404 | 10.75 (7 rows)
In a second test, we modify the script to include the ON CONFLICT DO NOTHING clause in the INSERT command and repeat both tests. This time, inserts do not result in errors; instead, they simply return “INSERT 0 0”, indicating that no records were inserted. Inspection of the Transaction ID after this test shows only a minimal increase, caused by background processes. Attempts to insert conflicting data did not result in increase of Transaction ID (XID), as PostgreSQL started first only virtual transaction to check for conflicts, and because a conflict was found, it aborted the transaction without having assigned a new XID. The “pgstattuple” output confirms that all objects contain only live data, with no dead tuples this time.
Summary
As demonstrated, each failed insert bloats the underlying table and some indexes, and increases the Transaction ID because each failed insert occurs in a separate transaction. Consequently, autovacuum is forced to run more frequently, consuming valuable system resources. Therefore applications still relying solely on plain INSERT commands without ON CONFLICT conditions should consider reviewing this implementation. But as always, the final decision should be based on the specific conditions of each application.
Introduction
Running ANALYZE
(either explicitly or via auto-analyze) is very important in order to have uptodate data statistics for the Postgres query planner. In particular after in-place upgrades via pg_upgrade
, ANALYZE
needs to be run in order to have any query statistics at all. As ANALYZE
samples only parts of the blocks in a table its I/O pattern looks more like random access than sequential read. Version 14 of Postgres has gained the possibility to use prefetching (if available, but this is the case on Linux) to tell the operating system kernel which blocks it will look at next. This is controlled via the maintenenance_io_concurrency
configuration parameter, which is set to 10 by default (contrary to effective_io_concurrency
, which is set to 1 by default).
Benchmark
In order to test and demonstrate the changes between version 13 and 14, we have done some quick benchmarks using the current maintenance releases (13.16 and 14.13) on Debian 12 with packages from https://apt.postgresql.org. Hardware-wise, a ThinkPad T14s Gen 3 with a Intel i7-1280P CPU with 20 cores and 32 GB of RAM was used. The basis is a pgbench database, initialized with scale factor of 1000:
$ pgbench -i -I dtg -s 1000 -d pgbench
This creates 100 million rows and leads to a database size of around 15 GB. In order to have ANALYZE
do a bit more work, we increase default_statistics_target
from the default of 100 to the same value as the pgbench scale factor (i.e., 1000). This results in ANALYZE
scanning around 20% of all blocks. We then analyze the main pgbench table, pgbench_accounts
:
$ vacuumdb -Z -v -d pgbench -t pgbench_accounts
INFO: analyzing "public.pgbench_accounts"
INFO: "pgbench_accounts": scanned 300000 of 1639345 pages,
containing 18300000 live rows and 0 dead rows;
300000 rows in sample, 100000045 estimated total rows
Between runs, the file system page cache is dropped via echo 3 | sudo tee /proc/sys/vm/drop_caches
and all runs are repeated three times. The following table lists the run-times (in seconds) of the above vacuumdb
command for various settings of maintenance_io_concurrency
:
Version | 0 | 1 | 5 | 10 | 20 | 50 | 100 | 500 |
---|---|---|---|---|---|---|---|---|
13 | 19.557 | 21.610 | 19.623 | 21.060 | 21.463 | 20.533 | 20.230 | 20.537 |
14 | 24.707 | 29.840 | 8.740 | 5.777 | 4.067 | 3.353 | 3.007 | 2.763 |
Analysis
Two things are very clear from those numbers: First, the run-times do not change for version 13, the value of maintenance_io_concurrency
has no effect for this version. Second, once prefetching kicks in for version 14 (maintenance_io_concurrency
is 5 or more), ANALYZE
gets several times faster, up to a factor of 6-7x. The default value of maintenance_io_concurrency
of 10 is already 3-4x faster and values larger than 50 show only minor further improvements, at least for this benchmark on this hardware. Also notable is that the run-times when prefetching is turned off (maintenance_io_concurrency=0
) or only set to 1 are worse than for version 13, but as the default for maintenance_io_concurrency
is 10, this should not affect anybody in practice.
Conclusion
Enabling prefetching for ANALYZE
in version 14 of PostgreSQL has made statistics sampling much faster. The default value of 10 for maintenance_io_concurrency
is already quite good, but we advise to increase it to 20-50 (or higher) in case high-performing local NVME storage is used. In a future quick benchmark, we plan to compare the ANALYZE
performance for the major versions since 14. In particular, the upcoming 17 release promises some further improvements to ANALYZE
due to the new streaming I/O interface.
On Thursday, 27 June, and Friday, 28 June 2024, I had the amazing opportunity to attend Swiss PGDay 2024. The conference was held at the OST Eastern Switzerland University of Applied Sciences, Campus Rapperswil, which is beautifully situated on the banks of Lake Zurich in a nice, green environment. With approximately 110 attendees, the event had mainly a B2B focus, although not exclusively. Despite the conference being seemingly smaller in scale compared to PostgreSQL events in larger countries, it actually reflected perfectly the scope relevant for Switzerland.
During the conference, I presented my talk “GIN, BTREE_GIN, GIST, BTREE_GIST, HASH & BTREE Indexes on JSONB Data“. The talk summarized the results of my long-term project at NetApp, including newer interesting findings compared to the presentation I gave in Prague at the beginning of June. As far as I could tell, my talk was well received by the audience, and I received very positive feedback.
At the very end on Friday, I also presented a lightning talk, “Can PostgreSQL Have a More Prominent Role in the AI Boom?” (my slides are at the end of the file). In this brief talk, I raised the question of whether it would be possible to implement AI functionality directly into PostgreSQL, including storing embedding models and trained neural networks within the database. Several people in the audience, involved with ML/AI, reacted positively on this proposal, acknowledging that PostgreSQL could indeed play a more significant role in ML and AI topics.
The conference featured two tracks of presentations, one in English and the other in German, allowing for a diverse range of topics and speakers. I would like to highlight some of them:
- Tomas Vondra presented “The Past and the Future of the Postgres Community“, explaining how work on PostgreSQL changes and fixes is organized in Commitfests and discussing future development ideas within the community.
- Laurenz Albe’s talk, “Sicherheitsattacken auf PostgreSQL“, highlighted several potential attack vectors in PostgreSQL, capturing significant attention with surprising examples.
- Chris Engelbert’s presentation, “PostgreSQL on Kubernetes: Dos and Don’ts“, addressed the main issues related to running PostgreSQL on Kubernetes and discussed solutions, including pros and cons of existing PostgreSQL Kubernetes operators.
- Maurizio De Giorgi and Ismael Posada Trobo discussed “Solving PostgreSQL Connection Scalability Issues: Insights from CERN’s GitLab Service“, detailing the challenges and solutions for scalability in CERN’s vast database environment.
- Dirk Krautschick’s talk, “Warum sich PostgreSQL-Fans auch für Kafka und Debezium interessieren sollten?“, showcased examples of using Debezium connectors and Kafka with PostgreSQL for various use cases, including data migrations.
- Patrick Stählin discussed “Wie wir einen Datenkorruptions-Bug mit der Hilfe der Community gefunden und gefixt haben,” addressing issues with free space map files after migration to PostgreSQL 16.
- Marion Baumgartner’s presentation, “Geodaten-Management mit PostGIS,” provided interesting details about processing geo-data in PostgreSQL using the PostGIS extension.
- Prof. Stefan Keller, one of the main organizers and a professor of Data Engineering at Rapperswil OST University, presented “PostgreSQL: A Reliable and Extensible Multi-Model SQL Database“, discussing the multi-model structure of PostgreSQL amid declining interest in NoSQL solutions.
- Luigi Nardi from DBTune presented “Lessons Learned from Autotuning PostgreSQL“, describing an AI-based performance tuning tool developed by his company.
- Kanhaiya Lal and Belma Canik delved into “Beyond Keywords: AI-powered Text Search with pgvector for PostgreSQL,” exploring the use of the pgvector extension to enhance full-text search capabilities in PostgreSQL.
- Gabriele Bartolini, the creator of the PostgreSQL Kubernetes Operator “CloudNativePG,” discussed the history and capabilities of this operator in his talk, “Unleashing the Power of PostgreSQL in Kubernetes“.
At the end of the first day, all participants were invited to a social event for networking and personal exchange, which was very well organized. I would like to acknowledge the hard work and dedication of all the organizers and thank them for their efforts. Swiss PGDay 2024 was truly a memorable and valuable experience, offering great learning opportunities. I am grateful for the chance to participate and contribute to the conference, and I look forward to future editions of this event. I am also very thankful to NetApp-credativ for making my participation in the conference possible.
Photos by organizers, Gülçin Yıldırım Jelínek and author:
Hashicorp Terraform is a well-known infrastructure automation tool mostly targeting cloud deployments. Instaclustr (a part of NetApp’s CloudOps division and credativ’s parent company) provides a managed service for various data stores, including PostgreSQL. Provisioning managed clusters is possible via the Instaclustr console, a REST API or through the Instaclustr Terraform Provider.
In this first part of a blog series, it is shown how a Postgres cluster can be provisioned using the Instaclustr Terraform Provider, including whitelisting the IP address and finally connecting to it via the psql
command-line client.
Initial Setup
The general requirement is having an account for the Instaclustr Managed service. The web console is located at https://console2.instaclustr.com/. Next, a provisioning API key needs to be created if not available already, as explained here.
Terraform providers are usually defined and configured in a file called provider.tf
. For the Instaclustr Terraform provider, this means adding it to the list of required providers and setting the API key mentioned above:
terraform {
required_providers {
instaclustr = {
source = "instaclustr/instaclustr"
version = ">= 2.0.0, < 3.0.0"
}
}
}
variable "ic_username" {
type = string
}
variable "ic_api_key" {
type = string
}
provider "instaclustr" {
terraform_key = "Instaclustr-Terraform ${var.ic_username}:${var.ic_api_key}"
}
Here, ic_username
and ic_api_key
are defined as variables. They should be set in a terraform.tfvars
file in the same directory
ic_username = "username"
ic_api_key = "0db87a8bd1[...]"
As the final preparatory step, Terraform needs to be initialized, installing the provider:
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Finding instaclustr/instaclustr versions matching ">= 2.0.0, < 3.0.0"...
- Installing instaclustr/instaclustr v2.0.136...
- Installed instaclustr/instaclustr v2.0.136 (self-signed, key ID 58D5F4E6CBB68583)
[...]
Terraform has been successfully initialized!
Defining Resources
Terraform resources define infrastructure objects, in our case a managed PostgreSQL cluster. Customarily, they are defined in a main.tf
file, but any other file name can be chosen:
resource "instaclustr_postgresql_cluster_v2" "main" {
name = "username-test1"
postgresql_version = "16.2.0"
private_network_cluster = false
sla_tier = "NON_PRODUCTION"
synchronous_mode_strict = false
data_centre {
name = "AWS_VPC_US_EAST_1"
cloud_provider = "AWS_VPC"
region = "US_EAST_1"
node_size = "PGS-DEV-t4g.small-5"
number_of_nodes = "2"
network = "10.4.0.0/16"
client_to_cluster_encryption = true
intra_data_centre_replication {
replication_mode = "ASYNCHRONOUS"
}
inter_data_centre_replication {
is_primary_data_centre = true
}
}
}
The above defines a 2-node cluster named username-test1
(and referred to internally as main
by Terraform) in the AWS US_EAST_1
region with PGS-DEV-t4g.small-5
instance sizes (2 vCores, 2 GB RAM, 5 GB data disk) for the nodes. Test/developer instance sizes for the other cloud providers would be:
Cloud Provider | Default Region | Instance Size | Data Disk | RAM | CPU |
---|---|---|---|---|---|
AWS_VPC | US_EAST_1 | PGS-DEV-t4g.small-5 | 5 GB | 2 GB | 2 Cores |
AWS_VPC | US_EAST_1 | PGS-DEV-t4g.medium-30 | 30 GB | 4 GB | 2 Cores |
AZURE_AZ | CENTRAL_US | PGS-DEV-Standard_DS1_v2-5-an | 5 GB | 3.5 GB | 1 Core |
AZURE_AZ | CENTRAL_US | PGS-DEV-Standard_DS1_v2-30-an | 30 GB | 3.5 GB | 1 Core |
GCP | us-west1 | PGS-DEV-n2-standard-2-5 | 5 GB | 8 GB | 2 Cores |
GCP | us-west1 | PGS-DEV-n2-standard-2-30 | 30 GB | 8 GB | 2 Core |
Other instance sizes or regions can be looked up in the console or in the section node_size
of the Instaclustr Terraform Provider documentation.
Running Terraform
Before letting Terraform provision the defined resources, it is best-practice to run terraform plan
. This lets Terraform plan the provisioning as a dry-run, and makes it possible to review the expected actions before creating any actual infrastructure:
$ terraform plan
Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
+ create
Terraform will perform the following actions:
# instaclustr_postgresql_cluster_v2.main will be created
+ resource "instaclustr_postgresql_cluster_v2" "main" {
+ current_cluster_operation_status = (known after apply)
+ default_user_password = (sensitive value)
+ description = (known after apply)
+ id = (known after apply)
+ name = "username-test1"
+ pci_compliance_mode = (known after apply)
+ postgresql_version = "16.2.0"
+ private_network_cluster = false
+ sla_tier = "NON_PRODUCTION"
+ status = (known after apply)
+ synchronous_mode_strict = false
+ data_centre {
+ client_to_cluster_encryption = true
+ cloud_provider = "AWS_VPC"
+ custom_subject_alternative_names = (known after apply)
+ id = (known after apply)
+ name = "AWS_VPC_US_EAST_1"
+ network = "10.4.0.0/16"
+ node_size = "PGS-DEV-t4g.small-5"
+ number_of_nodes = 2
+ provider_account_name = (known after apply)
+ region = "US_EAST_1"
+ status = (known after apply)
+ intra_data_centre_replication {
+ replication_mode = "ASYNCHRONOUS"
}
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
[...]
When the planned output looks reasonable, it can be applied via terraform apply
:
$ terraform apply
Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
+ create
[...]
Plan: 1 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
instaclustr_postgresql_cluster_v2.main: Creating...
instaclustr_postgresql_cluster_v2.main: Still creating... [10s elapsed]
[...]
instaclustr_postgresql_cluster_v2.main: Creation complete after 5m37s [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
That is it! The PostgreSQL cluster is now up and running after barely 5 minutes.
IP Whitelisting
In order to access the PostgreSQL cluster, firewall rules need to be defined for IP-whitelisting. In general, any network address block can be defined, but in order to allow access from the host running Terraform, a firewall rule for the local public IP address can be set via a service like icanhazip.com, appending to main.tf
:
data "http" "myip" {
url = "https://ipecho.net/plain"
}
resource "instaclustr_cluster_network_firewall_rules_v2" "main" {
cluster_id = resource.instaclustr_postgresql_cluster_v2.main.id
firewall_rule {
network = "${chomp(data.http.myip.response_body)}/32"
type = "POSTGRESQL"
}
}
The usage of the http
Terraform module also needs an update to the providers.tf
file, adding it to the list of required providers:
terraform {
required_providers {
instaclustr = {
source = "instaclustr/instaclustr"
version = ">= 2.0.0, < 3.0.0"
}
http = {
source = "hashicorp/http"
version = "3.4.3"
}
}
}
And a subsequent re-run of terraform init
, followed by terraform apply
:
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Reusing previous version of instaclustr/instaclustr from the dependency lock file
- Finding hashicorp/http versions matching "3.4.3"...
- Using previously-installed instaclustr/instaclustr v2.0.136
- Installing hashicorp/http v3.4.3...
- Installed hashicorp/http v3.4.3 (signed by HashiCorp)
[...]
Terraform has been successfully initialized!
[...]
$ terraform apply
data.http.myip: Reading...
instaclustr_postgresql_cluster_v2.main: Refreshing state... [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]
data.http.myip: Read complete after 1s [id=https://ipv4.icanhazip.com]
Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
+ create
Terraform will perform the following actions:
# instaclustr_cluster_network_firewall_rules_v2.main will be created
+ resource "instaclustr_cluster_network_firewall_rules_v2" "main" {
+ cluster_id = "704e1c20-bda6-410c-b95b-8d22ef3f5a04"
+ id = (known after apply)
+ status = (known after apply)
+ firewall_rule {
+ deferred_reason = (known after apply)
+ id = (known after apply)
+ network = "123.134.145.5/32"
+ type = "POSTGRESQL"
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
instaclustr_cluster_network_firewall_rules_v2.main: Creating...
instaclustr_cluster_network_firewall_rules_v2.main: Creation complete after 2s [id=704e1c20-bda6-410c-b95b-8d22ef3f5a04]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Connecting to the Cluster
In order to connect to the newly-provisioned Postgres cluster, we need the public IP addresses of the nodes, and the password of the administrative database user, icpostgresql
. Those are retrieved and stored in the Terraform state by the Instaclustr Terraform provider, by default in a local file terraform.tfstate
. To secure the password, one can change the password after initial connection, secure the host Terraform is run from, or store the Terraform state remotely.
The combined connection string can be setup as an output variable in a outputs.tf
file:
output "connstr" {
value = format("host=%s user=icpostgresql password=%s dbname=postgres target_session_attrs=read-write",
join(",", [for node in instaclustr_postgresql_cluster_v2.main.data_centre[0].nodes:
format("%s", node.public_address)]),
instaclustr_postgresql_cluster_v2.main.default_user_password
)
sensitive = true
}
After another terraform apply
to set the output variable, it is possible to connect to the PostgreSQL cluster without having to type or paste the default password via:
$ psql "$(terraform output -raw connstr)"
psql (16.3 (Debian 16.3-1.pgdg120+1), server 16.2 (Debian 16.2-1.pgdg110+2))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.
postgres=>
Conclusion
In this first part, the provisioning of an Instaclustr Managed PostgreSQL cluster with Terraform was demonstrated. In the next part of this blog series, we plan to present a Terraform module that makes it even easier to provision PostgreSQL clusters. We will also check out which input variables can be set to further customize the managed PostgreSQL cluster.
Instaclustr offers a 30-day free trial for its managed service which allows to provision clusters with development instance sizes, so you can signup and try the above yourself today!