cancel
Showing results for 
Search instead for 
Did you mean: 

EFM Failover issue when Master server restarts

Tom
Adventurer

EFM Failover issue when Master server restarts

I had setup the EFM 3.2 and have verified the manual promotion (switchover/switchback) successfully.
But when my master server gets restarted, failover is not happening. I can see this message in standby efm.log.

"Failover is disabled for the cluster until the master agent is restarted."

I have verified all the efm properties (autofailover and all other parameters), everything set up properly and looks fine. If i stop the master db manually, failover is happening. But failover is not happening when master server is restarted/unreachable. Can anyone help on this?

18 REPLIES
EDB Team Member

Re: EFM Failover issue when Master server restarts

Hi Tom,

 

When you restart the master server. EFM agent gets a clean shutdown. In this scenario, EFM doesn't consider this as master cluster failure and do not promote the standby.

Consider, for example, the case where a user wants to do some kind of maintenance on a database. The user stops the EFM agent, stops the database, does <whatever>, then starts the database and the agent. To the rest of the cluster, this is exactly the same as someone shutting down the node, whether on purpose or by accident.

 

This same has been updated in EFM document. 

Kindly refer below for the document link :

8.3 Master Agent Exits or Node Fails

 

Note: Kindly refer the last line of the Document.

 

Hope this helps!

 

Regards,

Dhananjay

Level 3 Adventurer

Re: EFM Failover issue when Master server restarts

Hi Dhananjay,

 

The reply below is a bit confusing: You have mentioned that its a clean shutdown and hence EFM does not consider this as a master failure and have then mentioned a link. However the link states the foll: 

Screen shot 2014-12-15 at 7

As per above, in case of master node restart, the answer to all 3 orange boxes will "NO", and hence standby will be promoted as master, isnt it?

 

Thanks 

Highlighted
EDB Team Member

Re: EFM Failover issue when Master server restarts

Hi, pcpg,

 

When we shut down the OS or gives manual agent shut down command for any planned activity. At that point, EFM gives a notification that Master agent has been shut down on purpose and it should wait till it gets restarted, for failover to re-enable. 

Kindly refer below for sample notification/logs :

 

 

4/5/18 1:45:02 PM com.enterprisedb.efm.nodes.EfmNode handleViewChange INFO: Master agent has shut down.
4/5/18 1:45:02 PM com.enterprisedb.efm.utils.Notifications sendMail INFO: Sending notification:
To: [dhananjay.jejurkar@enterprisedb.com]
Subject: [INFO] EFM Cluster efm notified that master node has left
Body:
EFM node:     192.168.56.103
Cluster name:  efm
Database name: edb
VIP support:   ENABLED 
 Failover is disabled for the cluster until the master agent is restarted.

 

Above scenario has not been covered in flow chart diagram in the documentation. There is only one liner update at the last of 8.3 section of the document. i.e: "Please note that stopping an agent does not signal the cluster that the agent has failed."

 

Hope this helps!

 

Regards,

Dhananjay.

Level 3 Adventurer

Re: EFM Failover issue when Master server restarts

Missed your note in the previous update, that its the last line of the section that applies. Makes sense.

 

Thanks

Tom
Adventurer

Re: EFM Failover issue when Master server restarts

Hi Dhananjay,

 

Thanks for the update. I have tested this scenario in 2 different environments. I have restarted my Master server (with init 6 / reboot). It is behaving differently.

 

RHEL - Automatic failover is happening,

SuSe Linux - Automatic failover is not happening.

 

Could you please check if this behaviour is something OS specific?

EDB Team Member

Re: EFM Failover issue when Master server restarts

Hi pcpg,

 

Thank you for your response. Can you please share EFM logs with us?

 

Thanks & Regards,

Ninad Shah

Tom
Adventurer

Re: EFM Failover issue when Master server restarts

I have tested it again and given the logs below. After master server manual restart, Automatic failover happened.

 

Cluster Setup:
Master - 192.168.137.101 (Linux1)
Standby - 192.168.137.201 (Linux2)
Witness - 192.168.137.30 (Linux3)

OS - RedHat Oracle Linux 6.9

 

1) Current Status:

 

[root@linux1 bin]]# date;./efm cluster-status efm;date;
Tue Aug 14 21:26:07 IST 2018
Cluster Status: efm

Agent Type Address Agent DB VIP
-----------------------------------------------------------------------
Master 192.168.137.101 UP UP 192.168.137.150*
Standby 192.168.137.201 UP UP 192.168.137.150
Witness 192.168.137.30 UP N/A 192.168.137.150

Allowed node host list:
192.168.137.201 192.168.137.101 192.168.137.30

Membership coordinator: 192.168.137.201

Standby priority host list:
192.168.137.201

Promote Status:

DB Type Address XLog Loc Info
--------------------------------------------------------------
Master 192.168.137.101 0/7D0001E0
Standby 192.168.137.201 0/7D0001E0

Standby database(s) in sync with master. It is safe to promote.
Tue Aug 14 21:26:09 IST 2018
[root@linux1 bin]#

 

2) Restarting the master node (Linux1)

 

[root@linux1 ~]# date;init 6;date;
Tue Aug 14 21:26:25 IST 2018
Tue Aug 14 21:26:25 IST 2018
[root@linux1 ~]#


3) Checking the efm cluster status on Linux2 (Failover happened and Linux2 became as master).

 

[root@linux2 bin]# date;./efm cluster-status efm;date;
Tue Aug 14 21:28:43 IST 2018
Cluster Status: efm

Agent Type Address Agent DB VIP
-----------------------------------------------------------------------
Master 192.168.137.201 UP UP 192.168.137.150*
Witness 192.168.137.30 UP N/A 192.168.137.150

Allowed node host list:
192.168.137.201 192.168.137.101 192.168.137.30

Membership coordinator: 192.168.137.201

Standby priority host list:
(List is empty.)

Promote Status:

DB Type Address XLog Loc Info
--------------------------------------------------------------
Master 192.168.137.201 0/7D000328

No standby databases were found.
Tue Aug 14 21:28:46 IST 2018
[root@linux2 bin]#


4) efm.log from Linux2 ( i.e., Old Standby & Current Master).

 

8/14/18 9:25:31 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut='PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2068ms
rtt min/avg/max/mdev = 65.959/91.051/133.599/30.247 ms'}

8/14/18 9:25:31 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:25:31 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.101:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:26:08 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:26:47 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:26:55 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:03 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:08 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:09 PM com.enterprisedb.efm.utils.ClusterStateVerifier$TimerThread run INFO: verifier thread exiting

8/14/18 9:27:09 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:27:11 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:19 PM com.enterprisedb.efm.utils.ClusterUtils sendStatusRequestToNodes WARNING: Could not reach node(s): 192.168.137.101

8/14/18 9:27:19 PM com.enterprisedb.efm.nodes.EfmNode verifyClusterState WARNING: There was an error while obtaining status from some nodes. Will wait and attempt again

8/14/18 9:27:19 PM com.enterprisedb.efm.utils.ClusterStateVerifier$TimerThread run INFO: Cluster state verifier thread started.

8/14/18 9:27:19 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:27 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux2-16624: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode suspect WARNING: Address suspect: linux1-62553 at 192.168.137.101

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode viewAccepted INFO: View changed. I am linux2-16624 Total jgroups nodes: 2

8/14/18 9:27:30 PM com.enterprisedb.efm.utils.ClusterStateVerifier$TimerThread run INFO: verifier thread exiting

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode updateStableClusterSize INFO: Cluster size now 2, was 3.

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Checking to see if node is responsive. Check type: DB_PING

8/14/18 9:27:30 PM com.enterprisedb.efm.utils.ClusterStateVerifier$TimerThread run INFO: Cluster state verifier thread started.

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode$30 run INFO: Telling node linux3-43867 to ping DB_PING

8/14/18 9:27:30 PM com.enterprisedb.efm.utils.ClusterUtils getObjectFromNode INFO: Sending request DB_PING\|192.168.137.101 to node linux3-43867

8/14/18 9:27:30 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.101:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:27:30 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.101: org.postgresql.util.PSQLException: The connection attempt failed., state: 08001, time: 178

8/14/18 9:27:35 PM com.enterprisedb.efm.nodes.EfmNode checkClusterStatus INFO: Sending status message to all nodes: node_status

8/14/18 9:27:35 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread db_ping_local in case still running.

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread rem_ping_0 in case still running.

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread ping_monitor in case still running.

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmNode updateStableClusterSize INFO: At least one node was suspect before leaving. Starting timer to update known cluster size.

8/14/18 9:27:40 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [/bin/ping -q -c3 -w5 8.8.8.8]

8/14/18 9:27:40 PM com.enterprisedb.efm.utils.ClusterSizeUpdater$TimerThread run INFO: updater thread started

8/14/18 9:27:41 PM com.enterprisedb.efm.admin.AdminServerThread processRequest INFO: admin service responding: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false||Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false||192.168.137.201 192.168.137.101 192.168.137.30%%192.168.137.201%%192.168.137.201%%(List is empty.)

8/14/18 9:27:42 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut='PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2070ms
rtt min/avg/max/mdev = 69.248/74.933/81.956/5.282 ms'}

8/14/18 9:27:42 PM com.enterprisedb.efm.nodes.EfmNode handleViewChange SEVERE: Master node has disappeared.

8/14/18 9:27:42 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Checking to see if node is responsive. Check type: DB_PING

8/14/18 9:27:42 PM com.enterprisedb.efm.nodes.EfmNode$30 run INFO: Telling node linux3-43867 to ping DB_PING

8/14/18 9:27:42 PM com.enterprisedb.efm.utils.ClusterUtils getObjectFromNode INFO: Sending request DB_PING\|192.168.137.101 to node linux3-43867

8/14/18 9:27:42 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.101:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:27:46 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.101: org.postgresql.util.PSQLException: Connection to 192.168.137.101:5445 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections., state: 08001, time: 4,001

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread db_ping_local in case still running.

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread rem_ping_0 in case still running.

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread ping_monitor in case still running.

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Checking to see if node is responsive. Check type: VIP_PING

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode$30 run INFO: Telling node linux3-43867 to ping VIP_PING

8/14/18 9:27:50 PM com.enterprisedb.efm.utils.ClusterUtils getObjectFromNode INFO: Sending request VIP_PING\|192.168.137.150 to node linux3-43867

8/14/18 9:27:50 PM com.enterprisedb.efm.utils.ClusterUtils$1 run INFO: Starting vip ping.

8/14/18 9:27:50 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [/bin/ping -q -c3 -w5 192.168.137.150]

8/14/18 9:27:50 PM com.enterprisedb.efm.utils.ClusterUtils$2 run INFO: Starting vip db ping.

8/14/18 9:27:55 PM com.enterprisedb.efm.exec.ExecUtil performExec WARNING: ProcessResult{exitValue=1, errorOut='', stdOut='PING 192.168.137.150 (192.168.137.150) 56(84) bytes of data.

--- 192.168.137.150 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5000ms'}

8/14/18 9:27:55 PM com.enterprisedb.efm.utils.ClusterUtils pingHost WARNING: Unable to ping host: 192.168.137.150

8/14/18 9:27:57 PM com.enterprisedb.efm.nodes.EfmNode checkClusterStatus INFO: Sending status message to all nodes: node_status

8/14/18 9:27:57 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:27:57 PM com.enterprisedb.efm.admin.AdminServerThread processRequest INFO: admin service responding: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false||Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false||192.168.137.201 192.168.137.101 192.168.137.30%%192.168.137.201%%192.168.137.201%%(List is empty.)

8/14/18 9:28:00 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread vip_ping_local in case still running.

8/14/18 9:28:00 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread rem_ping_0 in case still running.

8/14/18 9:28:00 PM com.enterprisedb.efm.nodes.EfmNode nodeStillUp INFO: Cleaning thread ping_monitor in case still running.

8/14/18 9:28:00 PM com.enterprisedb.efm.utils.ClusterUtils isVipReachable INFO: Interrupted waiting on vip check latch.

8/14/18 9:28:00 PM com.enterprisedb.efm.utils.ClusterUtils$3 run INFO: wait_vip_checks interrupted

8/14/18 9:28:00 PM com.enterprisedb.efm.utils.ClusterUtils$3 run INFO: VIP checks completed. Releasing done latch to continue.

8/14/18 9:28:00 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [/bin/ping -q -c3 -w5 8.8.8.8]

8/14/18 9:28:00 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Database check timed out after 10 seconds. Thread stack to follow

8/14/18 9:28:00 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: timed_db_check in state RUNNABLE
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.postgresql.core.PGStream.<init>(PGStream.java:68)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:450)
at org.postgresql.Driver.connect(Driver.java:252)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.enterprisedb.efm.DBMonitor.checkOnce(DBMonitor.java:467)
at com.enterprisedb.efm.DBMonitor$5.call(DBMonitor.java:409)
at com.enterprisedb.efm.DBMonitor$5.call(DBMonitor.java:406)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

8/14/18 9:28:00 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.150: org.postgresql.util.PSQLException: The connection attempt failed., state: 08001, time: 10,013

8/14/18 9:28:02 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut='PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2059ms
rtt min/avg/max/mdev = 56.147/97.873/156.675/42.780 ms'}

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmStandby masterFoundInCluster INFO: Checking to see if master has rejoined before promoting this standby.

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Standby%%192.168.137.201%%true%%true%%192.168.137.150%%false

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmStandby masterFoundInCluster INFO: Ignoring own node status.

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmStandby masterFoundInCluster INFO: Found node of type Witness

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmNode setLastPromoted INFO: Last promoted address set to 192.168.137.201

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmAgent setAgentDelegate INFO: setAgentDelegate: I was Standby, will now be Promoting.

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmAgent setAgentDelegate INFO: Setting agent to type Promoting

8/14/18 9:28:02 PM com.enterprisedb.efm.ClusterState removeAddress INFO: Removing standby address linux2-16624

8/14/18 9:28:02 PM com.enterprisedb.efm.ClusterState removeAddress INFO: Removing standby info 192.168.137.201 from failover priority list.

8/14/18 9:28:02 PM com.enterprisedb.efm.utils.VipMonitor stopMonitoring INFO: Stopping vip monitor

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmAgent notifyNodesToRunPrePromotionScript INFO: Notifying all nodes to execute pre-promotion script

8/14/18 9:28:02 PM com.enterprisedb.efm.utils.RecoveryConfUtils writeTriggerFile INFO: Writing trigger file as specified in recovery.conf.

8/14/18 9:28:02 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [sudo -u enterprisedb /usr/edb/efm-3.1/bin/efm_db_functions writetriggerfile efm]

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmPromoting$1 run INFO: Will monitor database until it is no longer in recovery. At that time, will switch to master node.

8/14/18 9:28:02 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut=''}

8/14/18 9:28:02 PM com.enterprisedb.efm.utils.ClusterUtils acquireVip INFO: Attempting to assign 192.168.137.150 to this node on interface eth3.

8/14/18 9:28:02 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [sudo /usr/edb/efm-3.1/bin/efm_address add4 eth3 192.168.137.150/255.255.255.0]

8/14/18 9:28:04 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut=''}

8/14/18 9:28:04 PM com.enterprisedb.efm.utils.Notifications sendMail INFO: Sending notification:
To: [test@test.com]
Subject: [INFO] EFM Assigning VIP to node 192.168.137.201
Body:
EFM node: 192.168.137.201
Cluster name: efm
Database name: edb
VIP: 192.168.137.150 (Active)

Assigning VIP 192.168.137.150 to node 192.168.137.201

Results:
exit status: 0

8/14/18 9:28:04 PM com.enterprisedb.efm.utils.Notifications sendMail INFO: Sending notification:
To: [test@test.com]
Subject: [WARNING] EFM Promotion has started on cluster efm
Body:
EFM node: 192.168.137.201
Cluster name: efm
Database name: edb
VIP: 192.168.137.150 (Active)

Promotion of standby has started on cluster efm.

8/14/18 9:28:06 PM com.enterprisedb.efm.utils.Notifications sendMail INFO: Sending notification:
To: [test@test.com]
Subject: [WARNING] EFM Failover has completed on cluster efm
Body:
EFM node: 192.168.137.201
Cluster name: efm
Database name: edb
VIP: 192.168.137.150 (Active)

Failover has completed on cluster efm.

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmAgent notifyNodesOfNewMaster INFO: Informing all nodes of new master address so that they can reconfigure.

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Message to reconfigure to use new master ignored by non-standby/idle agent.

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmAgent setAgentDelegate INFO: setAgentDelegate: I was Promoting, will now be Master.

8/14/18 9:28:06 PM com.enterprisedb.efm.utils.RecoveryConfFileMonitor startMonitoring INFO: Starting recovery.conf monitor for: /u01/10_pg_data/recovery.conf

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmAgent setAgentDelegate INFO: Setting agent to type Master

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode addToClusterStatus INFO: Replacing promoting node linux2-16624 with master linux2-16624.

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode addToClusterStatus INFO: A new database node has joined the cluster. All nodes will verify that they can connect to the database.

8/14/18 9:28:06 PM com.enterprisedb.efm.utils.VipMonitor startMonitoring INFO: Starting master vip monitor (to ensure this node is assigned the VIP)

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmAgent notifyNodesToRunPostPromotionScript INFO: Notifying all nodes to execute post-promotion script

8/14/18 9:28:06 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.201:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:28:44 PM com.enterprisedb.efm.nodes.EfmNode checkClusterStatus INFO: Sending status message to all nodes: node_status

8/14/18 9:28:44 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Master%%192.168.137.201%%true%%true%%192.168.137.150%%true

8/14/18 9:28:44 PM com.enterprisedb.efm.admin.AdminServerThread processRequest INFO: admin service responding: Master%%192.168.137.201%%true%%true%%192.168.137.150%%true||Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false||192.168.137.201 192.168.137.101 192.168.137.30%%192.168.137.201%%(List is empty.)%%(List is empty.)

8/14/18 9:29:10 PM com.enterprisedb.efm.utils.ClusterStateVerifier$TimerThread run INFO: verifier thread exiting

8/14/18 9:29:10 PM com.enterprisedb.efm.nodes.EfmAgent handleStatusCall INFO: responding to db status request with status: Master%%192.168.137.201%%true%%true%%192.168.137.150%%true

8/14/18 9:30:06 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [sudo -u enterprisedb /usr/edb/efm-3.1/bin/efm_db_functions recoveryconfexists efm]

8/14/18 9:30:06 PM com.enterprisedb.efm.exec.ExecUtil performExec WARNING: ProcessResult{exitValue=1, errorOut='', stdOut=''}

 


5) efm.log from Witness (Linux3)


8/14/18 9:25:31 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:25:31 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.101:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:25:31 PM com.enterprisedb.efm.nodes.EfmNode$8 run INFO: Reloading shared state per coordinator command

8/14/18 9:25:31 PM com.enterprisedb.efm.nodes.EfmNode setState INFO: setState called: [ master: linux1-62553 promoting: null standbys (1): linux2-16624 witnesses (1): linux3-43867 idle nodes (0): ]

8/14/18 9:26:08 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:26:49 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:26:57 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:05 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:09 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:27:13 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:21 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:29 PM org.jgroups.protocols.TP sendToMembers SEVERE: JGRP000034: linux3-43867: failure sending message to linux1-62553: java.net.SocketTimeoutException: connect timed out

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode viewAccepted INFO: View changed. I am linux3-43867 Total jgroups nodes: 2

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode updateStableClusterSize INFO: Cluster size now 2, was 3.

8/14/18 9:27:30 PM com.enterprisedb.efm.nodes.EfmNode updateStableClusterSize INFO: Decreasing stable cluster size to 2

8/14/18 9:27:30 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [/bin/ping -q -c3 -w5 8.8.8.8]

8/14/18 9:27:32 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: ProcessResult{exitValue=0, errorOut='', stdOut='PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2057ms
rtt min/avg/max/mdev = 54.728/62.328/72.521/7.494 ms'}

8/14/18 9:27:40 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Database check timed out after 10 seconds. Thread stack to follow

8/14/18 9:27:40 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: timed_db_check in state RUNNABLE
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.postgresql.core.PGStream.<init>(PGStream.java:68)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)
at org.postgresql.Driver.makeConnection(Driver.java:450)
at org.postgresql.Driver.connect(Driver.java:252)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.enterprisedb.efm.DBMonitor.checkOnce(DBMonitor.java:467)
at com.enterprisedb.efm.DBMonitor$5.call(DBMonitor.java:409)
at com.enterprisedb.efm.DBMonitor$5.call(DBMonitor.java:406)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Handle called for DB check. Returning false.

8/14/18 9:27:40 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.101: org.postgresql.util.PSQLException: The connection attempt failed., state: 08001, time: 10,016

8/14/18 9:27:40 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:27:43 PM com.enterprisedb.efm.nodes.EfmNode$8 run INFO: Reloading shared state per coordinator command

8/14/18 9:27:50 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.101: org.postgresql.util.PSQLException: Connection to 192.168.137.101:5445 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections., state: 08001, time: 7,022

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Handle called for DB check. Returning false.

8/14/18 9:27:50 PM com.enterprisedb.efm.nodes.EfmNode setState INFO: setState called: [ master: null promoting: null standbys (1): linux2-16624 witnesses (1): linux3-43867 idle nodes (0): ]

8/14/18 9:27:50 PM com.enterprisedb.efm.utils.ClusterUtils$1 run INFO: Starting vip ping.

8/14/18 9:27:50 PM com.enterprisedb.efm.exec.ExecUtil performExec INFO: [/bin/ping -q -c3 -w5 192.168.137.150]

8/14/18 9:27:50 PM com.enterprisedb.efm.utils.ClusterUtils$2 run INFO: Starting vip db ping.

8/14/18 9:27:53 PM com.enterprisedb.efm.DBMonitor checkOnce WARNING: Could not connect to database at 192.168.137.150: org.postgresql.util.PSQLException: The connection attempt failed., state: 08001, time: 2,990

8/14/18 9:27:53 PM com.enterprisedb.efm.exec.ExecUtil performExec WARNING: ProcessResult{exitValue=1, errorOut='', stdOut='PING 192.168.137.150 (192.168.137.150) 56(84) bytes of data.

--- 192.168.137.150 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3008ms
pipe 4'}

8/14/18 9:27:53 PM com.enterprisedb.efm.utils.ClusterUtils pingHost WARNING: Unable to ping host: 192.168.137.150

8/14/18 9:27:53 PM com.enterprisedb.efm.utils.ClusterUtils$3 run INFO: VIP checks completed. Releasing done latch to continue.

8/14/18 9:27:53 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Handle called for VIP check for 192.168.137.150. Returning false.

8/14/18 9:27:57 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmNode setLastPromoted INFO: Last promoted address set to 192.168.137.201

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Last promoted address set to 192.168.137.201

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmNode$8 run INFO: Reloading shared state per coordinator command

8/14/18 9:28:02 PM com.enterprisedb.efm.nodes.EfmNode setState INFO: setState called: [ master: null promoting: linux2-16624 standbys (0): witnesses (1): linux3-43867 idle nodes (0): ]

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode handle INFO: Message to reconfigure to use new master ignored by non-standby/idle agent.

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode$8 run INFO: Reloading shared state per coordinator command

8/14/18 9:28:06 PM com.enterprisedb.efm.nodes.EfmNode setState INFO: setState called: [ master: linux2-16624 promoting: null standbys (0): witnesses (1): linux3-43867 idle nodes (0): ]

8/14/18 9:28:06 PM com.enterprisedb.efm.DBMonitor checkOnce INFO: DB monitor testing jdbcSmiley Tongueostgresql://192.168.137.201:5445/edb?ApplicationName='efm-3.1'

8/14/18 9:28:45 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

8/14/18 9:29:10 PM com.enterprisedb.efm.nodes.EfmWitness handleStatusCall INFO: responding to db status request with status: Witness%%192.168.137.30%%true%%false%%192.168.137.150%%false

 

 

EDB Team Member

Re: EFM Failover issue when Master server restarts

Hi Tom,

 

Thank you for your response. After reviewing provided detail, We confirmed that this is a behavior with CentOS/RedHat 6.x. However, This is not happening with CentOS/RedHat 7.x.

Hope this may help you. Request you to let us know If provide solution satisfies you query.

 

Thanks & Regards,

Ninad Shah

Tom
Adventurer

Re: EFM Failover issue when Master server restarts

Ninad Shah,

 

Thanks for the update. As i mentioned previously, in another environment (SuSe Linux 12) automatic failover is not happening after master server manual restart (reboot/init 6). Is this expected behaviour in SuSe Linux 12?