GitLab PostgreSQL Data Recovery
Today, shit happened on a larger on-premise GitLab EE instance of one of our Onlime GmbH customers. GitLab's production.log started to fill up with PG::Error (FATAL: the database system is in recovery mode)
errors which were somehow related to LFS operations. That definitely didn't sound cool and smelled like data corruption. The customer noticed it by failed CI jobs with 500 Internal Server Errors, and let me know immediately.
As we have that GitLab server running in a LXC container on a ZFS based system (Proxmox VE), it was easy to pull a clone of the full system and play around with PostgreSQL data recovery before working on live data. I decided to go for a full data restore by dumping and loading it from scratch in a freshly initialized PostgreSQL data dir.
So these were the FATAL
errors I found in /var/log/gitlab/gitlab-rails/production.log
(this is a Omnibus install of GitLab EE):
Processing by Repositories::LfsStorageController#download as HTML
Parameters: {"repository_path"=>"group/project.git", "oid"=>"38387ae38..."}
PG::Error (FATAL: the database system is in recovery mode
):
Started GET "/group/project.git/gitlab-lfs/objects/372ea928b2..." for 192.168.x.x at 2021-09-01 09:19:31 +0200
ActiveRecord::ConnectionNotEstablished (FATAL: the database system is in recovery mode
):
The errors were always preceded by some LFS operation, so that looks correlated.
UPDATE 2021-09-01 23:50 – In the meantime (after the full recovery below - Shame on me ignorant sysadmin!), I have figured out this was actually caused by an OOM issue of that LXC container, where random PostgreSQL processes were killed by the system's oom-killer...
# /var/log/gitlab/postgresql/current 2021-09-01_07:19:31.05398 LOG: server process (PID 10068) was terminated by signal 9: Killed # /var/log/syslog Sep 1 09:19:31 git kernel: [5068062.072130] oom_kill_process.cold.33+0xb/0x10
But I still keep this article to explain how to do a full PostgreSQL data recovery, even though we never had any data corruption. :)
Check PostgreSQL Tools
First, check your PostgreSQL version and figure out where GitLab Omnibus install put all those commands:
$ gitlab-psql --version
psql (PostgreSQL) 12.6
$ gitlab-psql -c 'select version()'
PostgreSQL 12.6 on x86_64-pc-linux-gnu, ...
Check database sizes:
$ gitlab-psql -c 'SELECT pg_database.datname as "dbname", pg_database_size(pg_database.datname)/1024/1024 AS size_in_mb FROM pg_database ORDER by size_in_mb DESC'
dbname | size_in_mb
---------------------+------------
gitlabhq_production | 29151
template0 | 7
template1 | 7
postgres | 7
You'll find the following symlink:
/usr/bin/gitlab-psql -> /opt/gitlab/bin/gitlab-psql
The other commands are here:
/opt/gitlab/embedded/bin/pg_dump
/opt/gitlab/embedded/bin/pg_dumpall
But DON'T try to do a dump as root
, always do it under user gitlab-psql
:
$ su - gitlab-psql
(gitlab-psql)$ echo $PWD
/var/opt/gitlab/postgresql
(gitlab-psql)$ which pg_dumpall
/opt/gitlab/embedded/bin/pg_dumpall
# connect over socket
(gitlab-psql)$ pg_dumpall -h /var/opt/gitlab/postgresql
# or connect over port
(gitlab-psql)$ pg_dumpall -h localhost -p 9187
If you're unsure about the port your PostgreSQL is running, check netstat -tulpen
(don't need to visit the Netherlands for that!) and grab it from there. If you don't find any running postgres process that binds to some port, try to connect to the socket directly with -h /var/opt/gitlab/postgresql
.
For below recovery, consult the following PostgreSQL 12 documentation:
Data Recovery Migration
The Idea
- Stop all GitLab services, only keep PostgreSQL running
- Make a full data dump with
pg_dumpall
- Stop PostgreSQL
- Move current PostgreSQL data dir away, so we can start from scratch
- Re-initialize PostgreSQL data dir with
initdb
- Start PostgreSQL
- Restore full data dump by feeding it to
gitlab-psql
- Stop all GitLab services, reconfigure GitLab and restart all services
Dump / Restore
Stop all GitLab services and only keep PostgreSQL running:
$ gitlab-ctl stop
$ gitlab-ctl start postgresql
$ gitlab-ctl status
# You might also want to stop cron, so it won't interfere your migration
$ systemctl stop cron && systemctl disable cron
# Don't forget to start and re-enable it again after migration!
To repeat (I found this quite confusing as PostgreSQL does not always bind to a port):
After starting
postgresql
, always usenetstat -tulpen
to check on which port it is running. If it's not running on a port, connect to the socket with-h /var/opt/gitlab/postgresql
instead of-h localhost -p <PORT>
, see examples below.
Now, make full gzipped dump with pg_dumpall
:
$ su - gitlab-psql
(gitlab-psql)$ pg_dumpall -h /var/opt/gitlab/postgresql | gzip > pg_dumpall.sql.gz
Re-create PostgreSQL data dir /var/opt/gitlab/postgresql/data
from scratch with initdb
:
$ gitlab-ctl stop postgresql
$ su - gitlab-psql
(gitlab-psql)$ mv data data.12.6.BKUP
(gitlab-psql)$ initdb data
$ gitlab-ctl start postgresql
($PWD
of gitlab-psql
user is already /var/opt/gitlab/postgresql
, so you don't need to use full pathes in that context)
Restore dump:
$ su - gitlab-psql
(gitlab-psql)$ zcat pg_dumpall.sql.gz | gitlab-psql -h localhost -p 5432 -d postgres
NOTE:
SQL Dump: Using pg_dumpall: Actually, you can specify any existing database name to start from, but if you are loading into an empty cluster then
postgres
should usually be used.
Stop postgresql
and reconfigure GitLab, then start all GitLab services:
$ gitlab-ctl stop postgresql
$ gitlab-ctl reconfigure
$ gitlab-ctl start
Monitor logs:
$ tail -f /var/log/gitlab/gitlab-rails/production.log /var/log/gitlab/postgresql/current
All good!