GitLab PostgreSQL Data Recovery

September 1st, 2021 by Philip Iezzi 5 min read
cover image

Today, shit happened on a larger on-premise GitLab EE instance of one of our Onlime GmbH customers. GitLab's production.log started to fill up with PG::Error (FATAL: the database system is in recovery mode) errors which were somehow related to LFS operations. That definitely didn't sound cool and smelled like data corruption. The customer noticed it by failed CI jobs with 500 Internal Server Errors, and let me know immediately.

As we have that GitLab server running in a LXC container on a ZFS based system (Proxmox VE), it was easy to pull a clone of the full system and play around with PostgreSQL data recovery before working on live data. I decided to go for a full data restore by dumping and loading it from scratch in a freshly initialized PostgreSQL data dir.

So these were the FATAL errors I found in /var/log/gitlab/gitlab-rails/production.log (this is a Omnibus install of GitLab EE):

Processing by Repositories::LfsStorageController#download as HTML
  Parameters: {"repository_path"=>"group/project.git", "oid"=>"38387ae38..."}
PG::Error (FATAL:  the database system is in recovery mode
):

Started GET "/group/project.git/gitlab-lfs/objects/372ea928b2..." for 192.168.x.x at 2021-09-01 09:19:31 +0200
ActiveRecord::ConnectionNotEstablished (FATAL:  the database system is in recovery mode
):

The errors were always preceded by some LFS operation, so that looks correlated.

UPDATE 2021-09-01 23:50 – In the meantime (after the full recovery below - Shame on me ignorant sysadmin!), I have figured out this was actually caused by an OOM issue of that LXC container, where random PostgreSQL processes were killed by the system's oom-killer...

# /var/log/gitlab/postgresql/current
2021-09-01_07:19:31.05398 LOG:  server process (PID 10068) was terminated by signal 9: Killed

# /var/log/syslog
Sep  1 09:19:31 git kernel: [5068062.072130]  oom_kill_process.cold.33+0xb/0x10

But I still keep this article to explain how to do a full PostgreSQL data recovery, even though we never had any data corruption. :)

Check PostgreSQL Tools

First, check your PostgreSQL version and figure out where GitLab Omnibus install put all those commands:

$ gitlab-psql --version
psql (PostgreSQL) 12.6

$ gitlab-psql -c 'select version()'
PostgreSQL 12.6 on x86_64-pc-linux-gnu, ...

Check database sizes:

$ gitlab-psql -c 'SELECT pg_database.datname as "dbname", pg_database_size(pg_database.datname)/1024/1024 AS size_in_mb FROM pg_database ORDER by size_in_mb DESC'
       dbname        | size_in_mb 
---------------------+------------
 gitlabhq_production |      29151
 template0           |          7
 template1           |          7
 postgres            |          7

You'll find the following symlink:

/usr/bin/gitlab-psql -> /opt/gitlab/bin/gitlab-psql

The other commands are here:

/opt/gitlab/embedded/bin/pg_dump
/opt/gitlab/embedded/bin/pg_dumpall

But DON'T try to do a dump as root, always do it under user gitlab-psql:

$ su - gitlab-psql

(gitlab-psql)$ echo $PWD           
/var/opt/gitlab/postgresql
(gitlab-psql)$ which pg_dumpall
/opt/gitlab/embedded/bin/pg_dumpall

# connect over socket
(gitlab-psql)$ pg_dumpall -h /var/opt/gitlab/postgresql
# or connect over port
(gitlab-psql)$ pg_dumpall -h localhost -p 9187

If you're unsure about the port your PostgreSQL is running, check netstat -tulpen (don't need to visit the Netherlands for that!) and grab it from there. If you don't find any running postgres process that binds to some port, try to connect to the socket directly with -h /var/opt/gitlab/postgresql.

For below recovery, consult the following PostgreSQL 12 documentation:

Data Recovery Migration

The Idea

  1. Stop all GitLab services, only keep PostgreSQL running
  2. Make a full data dump with pg_dumpall
  3. Stop PostgreSQL
  4. Move current PostgreSQL data dir away, so we can start from scratch
  5. Re-initialize PostgreSQL data dir with initdb
  6. Start PostgreSQL
  7. Restore full data dump by feeding it to gitlab-psql
  8. Stop all GitLab services, reconfigure GitLab and restart all services

Dump / Restore

Stop all GitLab services and only keep PostgreSQL running:

$ gitlab-ctl stop
$ gitlab-ctl start postgresql
$ gitlab-ctl status

# You might also want to stop cron, so it won't interfere your migration
$ systemctl stop cron && systemctl disable cron
# Don't forget to start and re-enable it again after migration!

To repeat (I found this quite confusing as PostgreSQL does not always bind to a port):

After starting postgresql, always use netstat -tulpen to check on which port it is running. If it's not running on a port, connect to the socket with -h /var/opt/gitlab/postgresql instead of -h localhost -p <PORT>, see examples below.

Now, make full gzipped dump with pg_dumpall:

$ su - gitlab-psql
(gitlab-psql)$ pg_dumpall -h /var/opt/gitlab/postgresql | gzip > pg_dumpall.sql.gz

Re-create PostgreSQL data dir /var/opt/gitlab/postgresql/data from scratch with initdb:

$ gitlab-ctl stop postgresql

$ su - gitlab-psql
(gitlab-psql)$ mv data data.12.6.BKUP
(gitlab-psql)$ initdb data

$ gitlab-ctl start postgresql

($PWD of gitlab-psql user is already /var/opt/gitlab/postgresql, so you don't need to use full pathes in that context)

Restore dump:

$ su - gitlab-psql
(gitlab-psql)$ zcat pg_dumpall.sql.gz | gitlab-psql -h localhost -p 5432 -d postgres

NOTE:

SQL Dump: Using pg_dumpall: Actually, you can specify any existing database name to start from, but if you are loading into an empty cluster then postgres should usually be used.

Stop postgresql and reconfigure GitLab, then start all GitLab services:

$ gitlab-ctl stop postgresql
$ gitlab-ctl reconfigure
$ gitlab-ctl start

Monitor logs:

$ tail -f /var/log/gitlab/gitlab-rails/production.log /var/log/gitlab/postgresql/current

All good!