cancel
Showing results for 
Search instead for 
Did you mean: 

Auto-Failover Not Happening (EFM-3.4)

Adventurer

Auto-Failover Not Happening (EFM-3.4)

Good day, guys,

I'd like to ask for your help as I've been working at this for hours to no avail.

I have a Master, a Standby, and a Witness Node, and the cluster status run on all nodes indicate that they are syncing fine.

clusterstatussync.png


But when I stop the database cluster on the Master (using pg_ctl stop), auto-failover does not happen. Nothing happens also when I try to do a manual switchover (efm promote efm -switchover).

I've attached the efm.properties file and the efm.log log file on each node for your reference.

Regards,
Warren

PS.
The underlying DB engine is Postgres CE 11

db1 (Master) files: https://drive.google.com/open?id=1WPNncnC38J0Niy0uFJlcb4muQ7D3-UKn
db2 (Standby) files: https://drive.google.com/open?id=1hQrMYPFOHYDy2opxS69XToXRyGmGYcYL
witness files: https://drive.google.com/open?id=1AEy59Y3ZglQ2ZoCXJM5w94jj6X8mylq9

 

15 REPLIES 15
EDB Team Member

Re: Auto-Failover Not Happening (EFM-3.4)

Hi @wcruz,

 

We are not able to download the files from the link provided by you if the size is not big, could you please attach it to this thread only.

 

Regards,

Sudhir

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

Hi, Sudhir,

Is there an attach file option in the forum? All I can see is for photos.

Thanks.

 

Regards,

Warren

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

Also, how do we determine if it's a JGroups issue?

I tested auto-FO in the DR cluster, and it worked there. I copied over the config files from the DR nodes to PROD, modified the pertinent IPs for PROD....but still auto-FO is not happening in PROD.

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

I'm elso encountering this issue with the PROD Master that when I start the EFM service (systemctl start efm-3.4.service), the start-up will fail but the VIP will be successfully added to the network interface. The 7809 Admin process would also be spawned and doing a cluster-status on the Witness Node will show the Master there successfully.

I would then have to kill the 7809 process and other EFM processes and start the service again. This time it will start successfully.

If I remove the VIP by doing a network service restart (systemctl restart network), the failure to start the service will happen again if I attempt to start the EFM service.


EDB Team Member

Re: Auto-Failover Not Happening (EFM-3.4)

Hi Wcuz, 

 

Could you please look into below concerns. 

 

1. What is the error you are seeing when you do "efm promote efm -switchover" command...?

2. Could you please share efm.properties file from all three nodes. 

 

Regards

Siva.

 

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

Hi, Siva,

That's the weird thing, I don't see any errors when I run the command and even in the log file:

switchovercommand.png

These are the efm.properties files:
db1: https://ctrlv.it/txt/216628/3397291979
db2: https://ctrlv.it/txt/216629/1812239174
witness: https://ctrlv.it/txt/216630/1082399809

Thanks!

Regards,
Warren

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

I restarted the 3 nodes in the EFM cluster and the problem went away! HAHAHAHA!

When all else fails.....REBOOT!

EDB Team Member

Re: Auto-Failover Not Happening (EFM-3.4)

Hi @wcruz,

 

Agreed and glad to hear that the issue solved. From the logs, it looks like the communication was not happening between the nodes that could have been resolved after the reboot.

 

Regards,

Sudhir

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

My hunch is that the problem was in the JGroups. I made sure to kill all EFM processes when they weren't cleanly killed. I didn't do anything with JGroups, however.

How can I check if JGroups are still OK?

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)


@slonkar wrote:

Hi @wcruz,

 

Agreed and glad to hear that the issue solved. From the logs, it looks like the communication was not happening between the nodes that could have been resolved after the reboot.

 

Regards,

Sudhir


Hi, Sudhir,

If communication was not happening, how come "efm cluster-status efm" always gave out the correct values?

Do you mean communication JGroups-wise?

 

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

By the way, the OS on the DB nodes is RHEL 7 and on the Witness node is Centos 7.

EDB Team Member

Re: Auto-Failover Not Happening (EFM-3.4)

Hi @wcruz,

 

We recommend having all the servers on the same distribution of Linux.

 

To get a clearer picture of the possible root cause of the issue, could you please share the efm.properties files from all nodes and DB logs of the Master and Standby at the time you were performing the test.

 

Regards,

Sudhir

 

Adventurer

Re: Auto-Failover Not Happening (EFM-3.4)

Highlighted
EDB Team Member

Re: Auto-Failover Not Happening (EFM-3.4)

Hi wcruz,

 

To avoid delay in responses and to get a proper RCA for your issue, you can raise a case to "support@enterprisedb.com" with the subscribed user.