cancel
Showing results for 
Search instead for 
Did you mean: 

Master and Slave Not Syncing after Failover Test

Level 3 Adventurer

Master and Slave Not Syncing after Failover Test

Hi Guys,

 

I have configured EFM 3.2 with 3 servers(master,slave and witness).

 

In order to test failover,I have stopped the network service on master.Failover happened as expected.Now the slave came up as master and old master was not in the cluster.

 

EFM Status before Failover:

[root@ip-172-31-11-7 ~]# efm cluster-status efm-edb96
Cluster Status: efm-edb96

        Agent Type  Address              Agent  DB       VIP
        -----------------------------------------------------------------------
        Witness     172.31.11.7          UP     N/A
        Master      172.31.2.48          UP     UP
        Standby     172.31.4.185         UP     UP

Allowed node host list:
        172.31.11.7 172.31.4.185 172.31.2.48

Membership coordinator: 172.31.11.7

Standby priority host list:
        172.31.4.185

Promote Status:

        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      172.31.2.48          0/2800BE10
        Standby     172.31.4.185         0/2800BE10

        Standby database(s) in sync with master. It is safe to promote.

 

EFM Status after Failover:

[root@ip-172-31-11-7 ~]# efm cluster-status efm-edb96
Cluster Status: efm-edb96

        Agent Type  Address              Agent  DB       VIP
        -----------------------------------------------------------------------
        Witness     172.31.11.7          UP     N/A
        Master      172.31.4.185         UP     UP

Allowed node host list:
        172.31.11.7 172.31.4.185 172.31.2.48

Membership coordinator: 172.31.11.7

Standby priority host list:
        (List is empty.)

Promote Status:

        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      172.31.4.185         0/2800C050

        No standby databases were found.

 

Then I brought  up old master,changed the recovery.conf(adding the trigger file and host pointing to new master) and postgres.conf(hot_standby=on).After a restart of the edb and efm services,the old master came up as standy as expected.But they were not in sync.

 

[root@ip-172-31-11-7 ~]# efm cluster-status efm-edb96
Cluster Status: efm-edb96

        Agent Type  Address              Agent  DB       VIP
        -----------------------------------------------------------------------
        Witness     172.31.11.7          UP     N/A
        Standby     172.31.2.48          UP     UP
        Master      172.31.4.185         UP     UP

Allowed node host list:
        172.31.11.7 172.31.4.185 172.31.2.48

Membership coordinator: 172.31.11.7

Standby priority host list:
        172.31.2.48

Promote Status:

        DB Type     Address              XLog Loc         Info
        --------------------------------------------------------------
        Master      172.31.4.185         0/2800C2F0
        Standby     172.31.2.48          0/29000098

        One or more standby databases are not in sync with the master database.

 

Upon checking the pg_logs,I found some errors as mentioned below:

 

2018-10-11 02:47:36 EDT FATAL:  could not start WAL streaming: ERROR:  requested starting point 0/29000000 on timeline 4 is not in this server's history
        DETAIL:  This server's history forked from timeline 4 at 0/2800BEF0.

usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
2018-10-11 02:47:36 EDT LOG:  new timeline 5 forked off current database system timeline 4 before current recovery point 0/29000098

 

Below files for reference:

recovery.done on New Master:

[root@ip-172-31-4-185 data]# more recovery.done
standby_mode = 'on'
primary_conninfo = 'user=edbrepuser password=password host=172.31.2.48 port=5442 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
trigger_file='/EDB/edb-postgress/data/failover_trigger'
restore_command = 'scp enterprisedb@172.31.2.48:/EDB/edb-postgress/data/archive/%f'
recovery_target_timeline = 'latest'

 recovery.conf on Old Master(New Standby):

[root@ip-172-31-2-48 data]# more recovery.conf
# EDB Failover Manager
# This generated recovery.conf file prevents the db server from accidentally
# being restarted as a master since a failover or promotion has occurred
#standby_mode = on
#restore_command = 'echo 2>"recovery suspended on failed server node"; exit 1'
standby_mode = 'on'
primary_conninfo = 'user=edbrepuser password=password host=172.31.4.185 port=5442 sslmode=prefer sslcompression=1 krbsrvname=postgres target_session_attrs=any'
trigger_file='/EDB/edb-postgress/data/failover_trigger'
restore_command = 'scp enterprisedb@172.31.4.185:/EDB/edb-postgress/data/archive/%f'
recovery_target_timeline = 'latest'

 

Is pg_basebackup needed to bring both servers in sync? or any alternative for the same.

 

Please help me in understanding this

 

Regards,

Manisha

 

 

 

5 REPLIES
EDB Team Member

Re: Master and Slave Not Syncing after Failover Test

Hi Manisha,

 

This type of timeline issues happens for switchover and failover due to archiving is only happening on one server i.e. Master server and you need to change the restore_command every time whenever switchover/failover happens.

 

We recommend you to have either shared mount point for archives, or archives should be copying to local (Master) as well as remote (Standby) server.

 

In this case, no need to change the restore_command in the recovery.conf file and these kind of issues related to the timeline and history file not available will not occur after failover.

 

If you have test environment, please test the EFM failover/switchover after suggested archiving method.

 

Please let us know in case of any issues/queries.

 

 Regards,

Sudhir

Level 3 Adventurer

Re: Master and Slave Not Syncing after Failover Test

@slonkar, Thank you very much for your response. 

 

As recommended, we would like to implement archival copy to local (Master) as well as remote (Standby) server.

We understand that archival is happening only on Master server but not on Standby,  could please guide me how to implement archival copy on Standby as well.

 

Please correct me if my understanding is wrong.

 

EDB Team Member

Re: Master and Slave Not Syncing after Failover Test

Hi Manisha,

 

To enable archiving to both local (Matser) and remote (Standby), you can set the archive_command as below :
archive_command = 'cp %p /<archive_directory>/%f && scp %p enterprisedb@<standby_ip>:/<archive_directory>/%f'

 

NOTE :
1) Please change the <archive_directory> and <standby_ip> as per your environment.
2) After changing the archive_command parameter in the postgresql.conf file, you need to reload the database cluster.

Please let us know in case in case of any further issues/queries.

 

Regards,

Sudhir

Level 3 Adventurer

Re: Master and Slave Not Syncing after Failover Test

Hi Sudhir,

 

We implemented the archival copy by changing the archival command in postgresql.conf of both servers.But issue remains same.

 

Below error:

[root@ip-172-31-2-48 pg_log]# tail -500 enterprisedb-2018-10-12_055613.log
2018-10-12 05:56:13 EDT LOG:

        ** EnterpriseDB Dynamic Tuning Agent ********************************************
        *       System Utilization: 66 %                                                *
        *         Database Version: 9.6.10.17                                           *
        *            Database Size: 0.1    GB                                           *
        *                      RAM: 1.0    GB                                           *
        *            Shared Memory: 992    MB                                           *
        *       Max DB Connections: 112                                                 *
        *               Autovacuum: on                                                  *
        *       Autovacuum Naptime: 60   Seconds                                        *
        *********************************************************************************

2018-10-12 05:56:13 EDT LOG:  database system was shut down in recovery at 2018-10-12 05:51:03 EDT
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
2018-10-12 05:56:13 EDT LOG:  entering standby mode
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
usage: scp [-12346BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
           [-l limit] [-o ssh_option] [-P port] [-S program]
           [[user@]host1:]file1 ... [[user@]host2:]file2
2018-10-12 05:56:13 EDT FATAL:  requested timeline 5 is not a child of this server's history
2018-10-12 05:56:13 EDT DETAIL:  Latest checkpoint is at 0/29000028 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 0/2800BEF0.
2018-10-12 05:56:13 EDT LOG:  startup process (PID 4579) exited with exit code 1
2018-10-12 05:56:13 EDT LOG:  aborting startup due to startup process failure
2018-10-12 05:56:13 EDT LOG:  database system is shut down
[root@ip-172-31-2-48 pg_log]# more startup.log
WARNING --> PERL_INSTALL_PATH is not set in /EDB/edb-postgress/etc/sysconfig/plLanguages.config file
WARNING --> PYTHON_INSTALL_PATH is not set in /EDB/edb-postgress/etc/sysconfig/plLanguages.config file
WARNING --> TCL_INSTALL_PATH is not set in /EDB/edb-postgress/etc/sysconfig/plLanguages.config file
2018-10-12 05:56:13 EDT LOG:  redirecting log output to logging collector process
2018-10-12 05:56:13 EDT HINT:  Future log output will appear in directory "pg_log".

 

So we tried to reload the cluster i.e streaming replication with archive command as suggested.But facinf error in pg_basebackup.

 

Error:

[root@ip-172-31-2-48 bin]# ./pg_basebackup -R -D /EDB/edb-postgress/data -h 172.31.4.185 -p5442 -U edbrepuser -P
Password:
787331/787331 kB (100%), 1/1 tablespace
NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to be archived
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (240 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (480 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.

 

Postgresql.conf :

archive_mode = on       # enables archiving; off, on, or always
                                # (change requires restart)
#archive_command = 'cp %p /EDB/edb-postgress/data/archive/%f'           # command to use to archive a logfile segment
archive_command = 'cp %p /EDB/edb-postgress/data/archive/%f && scp %p enterprisedb@172.31.2.48:/EDB/edb-postgress/data/archive/%f'
                                # placeholders: %p = path of file to archive
                                #               %f = file name only
                                # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

 

Please suggest.

Regards,

Manisha

EDB Team Member

Re: Master and Slave Not Syncing after Failover Test

Hi Manisha,

 

Please refer below answers to your queries:

 

1) Regarding the pg_basebackup message, its not a error, it just warning message that pg_stop_backup will complete after all the archives which were generated during backup will be archived. Also, check whether your archiving is happening correctly, from the database logs.

 

2) Regarding the timeline related error, you need to rebuild the streaming again from Master. My recommendations for the archive_command and restore_command were for when you are setting up at the start.

 

3) After setting up streaming, please verify archiving is happening at both servers and restore_command should point to local archive location only.

 

Regards,

Sudhir