Good day, guys,
I'd like to ask for your help as I've been working at this for hours to no avail.
I have a Master, a Standby, and a Witness Node, and the cluster status run on all nodes indicate that they are syncing fine.
But when I stop the database cluster on the Master (using pg_ctl stop), auto-failover does not happen. Nothing happens also when I try to do a manual switchover (efm promote efm -switchover).
I've attached the efm.properties file and the efm.log log file on each node for your reference.
The underlying DB engine is Postgres CE 11
db1 (Master) files: https://drive.google.com/open?id=1WPNncnC38J0Niy0uFJlcb4muQ7D3-UKn
db2 (Standby) files: https://drive.google.com/open?id=1hQrMYPFOHYDy2opxS69XToXRyGmGYcYL
witness files: https://drive.google.com/open?id=1AEy59Y3ZglQ2ZoCXJM5w94jj6X8mylq9
We are not able to download the files from the link provided by you if the size is not big, could you please attach it to this thread only.
Also, how do we determine if it's a JGroups issue?
I tested auto-FO in the DR cluster, and it worked there. I copied over the config files from the DR nodes to PROD, modified the pertinent IPs for PROD....but still auto-FO is not happening in PROD.
I'm elso encountering this issue with the PROD Master that when I start the EFM service (systemctl start efm-3.4.service), the start-up will fail but the VIP will be successfully added to the network interface. The 7809 Admin process would also be spawned and doing a cluster-status on the Witness Node will show the Master there successfully.
I would then have to kill the 7809 process and other EFM processes and start the service again. This time it will start successfully.
If I remove the VIP by doing a network service restart (systemctl restart network), the failure to start the service will happen again if I attempt to start the EFM service.
Could you please look into below concerns.
1. What is the error you are seeing when you do "efm promote efm -switchover" command...?
2. Could you please share efm.properties file from all three nodes.
Agreed and glad to hear that the issue solved. From the logs, it looks like the communication was not happening between the nodes that could have been resolved after the reboot.