cancel
Showing results for 
Search instead for 
Did you mean: 

Restoration failed with PANIC: could not locate a valid checkpoint record

Highlighted
Roy
Adventurer

Restoration failed with PANIC: could not locate a valid checkpoint record

Hello Team,

Last week, I was working on a Backup Recovery Test of a EDB PPAS 9.5 cluster. usually, I perform the recovery by :

1. stopping the running cluster
2. renaming the same
3. restoring the backups to a different FS
4. editing its postgresql.auto.conf with a the new data_directory path
5. re-creating the tablespace links
6. Starting the cluster with the new data path

Old/Actual data directory : /xxdata/aaaa/bbb/postgres/data
Data directory for restoration : /back/aaabbb/postgres/BRT_aaabbb/data

 

OS Version : Red Hat Enterprise Linux Server release 6.8 (Santiago

PPAS version is  EnterpriseDB 9.5.0.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-55), 64-bi

and the recovery.conf looks like this:

restore_command = 'cp /back/aaabbb/postgres/BRT_aaabbb/walarch/%f %p'  ##is the location where I kept all latest Archived WALS copied from the actual archive location which is /xxdata/aaaa/bbb/postgres/pg_xlog_arch###
recovery_target_timeline = 'latest'

It had worked in all my past runs except this time which failed with the below error:

Userid@hostname:/back/aaabbb/postgres/BRT_aaabbb/pg_log]$ cat enterprisedb-2018-04-19_160720.log
2018-04-19 16:07:20 CEST LOG:  database system was interrupted; last known up at 2018-04-19 15:36:38 CEST
2018-04-19 16:07:21 CEST LOG:

        ** EnterpriseDB Dynamic Tuning Agent ********************************************
        *       System Utilization: 66 %                                                *
        *         Database Version: 9.5.0.5                                             *
        *            Database Size: 0.1    GB                                           *
        *                      RAM: 3.8    GB                                           *
        *            Shared Memory: 3828   MB                                           *
        *       Max DB Connections: 112                                                 *
        *               Autovacuum: on                                                  *
        *       Autovacuum Naptime: 60   Seconds                                        *
        *********************************************************************************

cp: cannot stat `/back/aaabbb/postgres/BRT_aaabbb/walarch/00000002.history': No such file or directory
2018-04-19 16:07:22 CEST LOG:  starting archive recovery
2018-04-19 16:07:22 CEST LOG:  invalid primary checkpoint record
2018-04-19 16:07:22 CEST LOG:  invalid secondary checkpoint record
2018-04-19 16:07:22 CEST PANIC:  could not locate a valid checkpoint record
2018-04-19 16:07:22 CEST LOG:  startup process (PID 20624) was terminated by signal 6: Aborted
2018-04-19 16:07:22 CEST LOG:  aborting startup due to startup process failure

Note: When I tried running a du -sh * on the /xxdata/aaaa/bbb/postgres/ I got the below error:

Userid@hostname /xxdata/aaaa/bbb/postgres]$ du -sh *
103M    data
44K     data_tblspc
8.0K    etc
8.0K    index_tblspc
0       nfstest
60K     pg_log
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  `pg_xlog_arch/.snapshot/AAABBB-20180419-235916'

du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  `pg_xlog_arch/.snapshotAAABBB-20180421-000109'

I requested our storage team to check for any possible storage corruption and they responded saying " is running most recent version of firmware 8.2.5P1
I was unable to find any record related to corrupt NAS filesystem. We never had similar issue before."

I have started the actual cluster in its original path and no errors thrown in the error log.

 

Could you please help and guide me here in this topic.

 

Appreciate your quick response.

 

Thanks,

Roy

3 REPLIES 3
Moderator

Re: Restoration failed with PANIC: could not locate a valid checkpoint record



du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  `pg_xlog_arch/.snapshotAAABBB-20180421-000109'

Hi Roy,

 

It does seem like there's something awry with your filesystem.  What is your archive_command in postgresql.conf?  And who created the pg_xlog_arch folder, and is there a way to see if the corresponding .snapshotAAABBB-20180421-000109 item is a folder or a file (the error message seems to suggest it's a folder, but I just want to be sure).  Do you have a script that creates .snapshot#### files?

Roy
Adventurer

Re: Restoration failed with PANIC: could not locate a valid checkpoint record

Hello Richy,

 

Thanks for the quick turnaround.

 

Archive parameters set in the running cluster is like below:

 

max_wal_senders = '2'
wal_level = 'archive'
archive_mode = 'on'
archive_command = ' test ! -f /xxdata/aaa/bbb/postgres/pg_xlog_arch/%f && cp %p /xxdata/aaa/bbb/postgres/pg_xlog_arch/%f'

 

OS admin created the FS's:

  25G  103M   25G   1% /xxdata/aaa/bbb/postgres
  2.0G   33M  2.0G   2% /xxdata/aaa/bbb/postgres/pg_xlog_arch

 

Inside postgres, this is the folder structure where you can see a snapshot folder
[userid@hostname /xxdata/aaa/bbb/postgres]$ ls -alhtr
total 44K
drwx------  3 aaabbb aaabbb 4.0K Mar  5 13:44 data_tblspc
drwx------  3 aaabbb aaabbb 4.0K Mar  5 13:44 index_tblspc
drwx------  3 aaabbb aaabbb 4.0K Mar  5 13:44 temp_tblspc
drwxrwxrwx  5 root     root     4.0K Mar  6 15:00 ..
drwxr-xr-x 10 aaabbb aaabbb 4.0K Apr 19 16:22 .
drwxr-xr-x  2 aaabbb aaabbb 4.0K Apr 19 16:23 etc
drwx------ 20 aaabbb aaabbb 4.0K Apr 19 16:23 data
drwxr-xr-x  2 aaabbb aaabbb 8.0K Apr 25 00:00 pg_log
drwxr-xr-x  3 aaabbb aaabbb 4.0K Apr 25 01:00 pg_xlog_arch
drwxrwxrwx 19 root     root     4.0K Apr 25 01:02 .snapshot
-rw-r--r--  1 root     root        0 Apr 25 07:30 nfstest

and again the same in pg_xlog_arch also like below:

[userid@hostname /xxdata/aaa/bbb/postgres/pg_xlog_arch]$ ls -alhtr
total 33M
drwxr-xr-x 10 aaabbb aaabbb 4.0K Apr 19 16:22 ..
-rw-------  1 aaabbb aaabbb  16M Apr 25 01:00 00000001000000000000006A
-rw-------  1 aaabbb aaabbb  16M Apr 25 01:00 00000001000000000000006B
drwxr-xr-x  3 aaabbb aaabbb 4.0K Apr 25 01:00 .
-rw-------  1 aaabbb aaabbb  307 Apr 25 01:00 00000001000000000000006B.00000028.backup
drwxrwxrwx 19 root     root     4.0K Apr 25 01:02 .snapshot
-rw-r--r--  1 root     root        0 Apr 25 07:30 nfstest

I think the .snapshot folder holds the FS snapshot backups.

 

Hope this gives more clarity. And hoping to hear from you soon.

 

Thanks,

Roy

Moderator

Re: Restoration failed with PANIC: could not locate a valid checkpoint record

 


@Roy wrote:

I think the .snapshot folder holds the FS snapshot backups.


Hmm, seems like the circular directory structure warnings might be unrelated to your inability to start up the restored cluster.  I found a few other posts online with similar errors--are you using NFS and/or NetApp?  Some links for you to check out:

 

 Regarding your inability to start up the restored cluster, it seems like you're working with a standby that was promoted at some point (hence the creation and expectation of a 00000002.history file).  How do you take your backups?  I wonder if you may be missing some stuff...or maybe the (promoted) standby didn't have archiving turned on?

 

For more information about the .history file and timelines and promotion, you may want to read an article on PostgreSQL timelines