A Methodology for Diagnosing PostgreSQL Slow Queries

Table of Contents

“You can’t optimize what you can’t measure.”

Slow queries hog connections, hold locks, block replication, trigger deadlocks, and waste resources. Every DBA must know how to find and fix them quickly.

Traditional tools
#

pg_stat_statements – essential extension that aggregates execution stats per normalized query: calls, total/mean/max time, rows per call, I/O time, etc. Always enable it.
Slow query logs – controlled via log_min_duration_statement. Great for one-off incidents or forensic analysis, but sampling thresholds mean you miss sub-threshold issues. Full logging is expensive but the ultimate truth when you need it.

Why monitoring helps
#

Static snapshots don’t show trends. Monitoring systems (Pigsty in my case) sample every few seconds, letting you rewind, compare before/after, and show stakeholders what’s happening. They also calm nervous bosses during incidents.

Workflow (simulated incident)
#

We spin up the Pigsty sandbox, run pgbench load (50 TPS writes on the primary, 1000 TPS reads on a replica), then deliberately drop pgbench_accounts_pkey to break index scans.

1. Detection
#

Cluster dashboards show QPS collapsing and response times spiking (1 ms → 300 ms). System load shoots above 200%, alarms fire.

2. Identification
#

Use the PG Query dashboard to find the worst offender. Query ID -6041100154778468427 has mean latency jumping from microseconds to hundreds of milliseconds while QPS plummets. Drill into PG Stat Statements to see the normalized SQL: SELECT abalance FROM pgbench_accounts WHERE aid = $1.

3. Hypothesis
#

Simple point lookup suddenly slow? Most likely the index vanished. Check PG Table Catalog and PG Table Detail: index scans drop to zero, seq scans soar. Hypothesis confirmed.

4. Fix
#

Recreate the index:

ALTER TABLE pgbench_accounts ADD PRIMARY KEY (aid);

Latency falls from seconds to milliseconds, QPS recovers, system load normalizes. Dashboards provide immediate feedback.

Summary
#

Detect – monitor query latency, concurrency, and system load.
Identify – use pg_stat_statements/monitoring to find the exact query (by query ID).
Hypothesize – analyze the SQL, review table/index metrics.
Fix & verify – add indexes, rewrite queries, adjust schema, then watch metrics confirm success.

Pigsty’s dashboards wrap these steps into a workflow, but the methodology applies with any monitoring stack: measure, locate, hypothesize, fix, verify.

Testing Disk Performance with FIO

2018-02-06·335 words·2 mins

PostgreSQL PG-Admin Performance

FIO is a convenient tool for testing disk I/O performance

Using sysbench to Test PostgreSQL Performance

2018-02-06·301 words·2 mins

PostgreSQL PG-Admin Performance

Although PostgreSQL provides pgbench, sometimes you need sysbench to outperform MySQL.

Incident-Report: Patroni Failure Due to Time Travel

2021-02-22·145 words·1 min

PostgreSQL PG-Admin Incident-Report

Machine restarted due to failure, NTP service corrected PG time after PG startup, causing Patroni to fail to start.

Online Primary Key Column Type Change

2021-01-15·1175 words·6 mins

PostgreSQL PG-Admin

How to change column types online, such as upgrading from INT to BIGINT?

Database Cluster Management Concepts and Entity Naming Conventions

2020-06-03·1509 words·8 mins

PostgreSQL PG-Admin Architecture

Concepts and their naming are very important. Naming style reflects an engineer’s understanding of system architecture. Poorly defined concepts lead to communication confusion, while carelessly set names create unexpected additional burden. Therefore, they need careful design.

PostgreSQL Data Page Corruption Repair

2018-11-29·2729 words·13 mins

PostgreSQL PG-Admin Data-Corruption Incident-Report

Using binary editing to repair PostgreSQL data pages, and how to make a primary key query return two records.

Traditional tools#

Why monitoring helps#

Workflow (simulated incident)#

1. Detection#

2. Identification#

3. Hypothesis#

4. Fix#

Summary#

Related