- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Auto-Failover Not Happening (EFM-3.4)
Good day, guys,
I'd like to ask for your help as I've been working at this for hours to no avail.
I have a Master, a Standby, and a Witness Node, and the cluster status run on all nodes indicate that they are syncing fine.
But when I stop the database cluster on the Master (using pg_ctl stop), auto-failover does not happen. Nothing happens also when I try to do a manual switchover (efm promote efm -switchover).
I've attached the efm.properties file and the efm.log log file on each node for your reference.
Regards,
Warren
PS.
The underlying DB engine is Postgres CE 11
db1 (Master) files: https://drive.google.com/open?id=1WPNncnC38J0Niy0uFJlcb4muQ7D3-UKn
db2 (Standby) files: https://drive.google.com/open?id=1hQrMYPFOHYDy2opxS69XToXRyGmGYcYL
witness files: https://drive.google.com/open?id=1AEy59Y3ZglQ2ZoCXJM5w94jj6X8mylq9
Archived Discussions
Effective March 31st, we will no longer engage on PostgresRocks.
How to engage with us further?
- Thought Leadership: EDB Blogs
- Tips and Tricks: Postgres Tutorials
- Customer Support: Create a Case Please note: Only customers with an active EDB support subscription and support portal authorization can create support ticket
- Engage on Stackoverflow While engaging on Stackoverflow tag the question with EDB or EnterpriseDB.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi @wcruz,
We are not able to download the files from the link provided by you if the size is not big, could you please attach it to this thread only.
Regards,
Sudhir
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi, Sudhir,
Is there an attach file option in the forum? All I can see is for photos.
Thanks.
Regards,
Warren
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
efm.log on Master: https://ctrlv.it/txt/216591/1548393556
efm.log on Standby: https://ctrlv.it/txt/216592/428387424
efm.log on Witness: https://ctrlv.it/txt/216593/1412515110
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Also, how do we determine if it's a JGroups issue?
I tested auto-FO in the DR cluster, and it worked there. I copied over the config files from the DR nodes to PROD, modified the pertinent IPs for PROD....but still auto-FO is not happening in PROD.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
I'm elso encountering this issue with the PROD Master that when I start the EFM service (systemctl start efm-3.4.service), the start-up will fail but the VIP will be successfully added to the network interface. The 7809 Admin process would also be spawned and doing a cluster-status on the Witness Node will show the Master there successfully.
I would then have to kill the 7809 process and other EFM processes and start the service again. This time it will start successfully.
If I remove the VIP by doing a network service restart (systemctl restart network), the failure to start the service will happen again if I attempt to start the EFM service.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi Wcuz,
Could you please look into below concerns.
1. What is the error you are seeing when you do "efm promote efm -switchover" command...?
2. Could you please share efm.properties file from all three nodes.
Regards
Siva.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi, Siva,
That's the weird thing, I don't see any errors when I run the command and even in the log file:
These are the efm.properties files:
db1: https://ctrlv.it/txt/216628/3397291979
db2: https://ctrlv.it/txt/216629/1812239174
witness: https://ctrlv.it/txt/216630/1082399809
Thanks!
Regards,
Warren
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
I restarted the 3 nodes in the EFM cluster and the problem went away! HAHAHAHA!
When all else fails.....REBOOT!
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi @wcruz,
Agreed and glad to hear that the issue solved. From the logs, it looks like the communication was not happening between the nodes that could have been resolved after the reboot.
Regards,
Sudhir
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
My hunch is that the problem was in the JGroups. I made sure to kill all EFM processes when they weren't cleanly killed. I didn't do anything with JGroups, however.
How can I check if JGroups are still OK?
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
@slonkar wrote:Hi @wcruz,
Agreed and glad to hear that the issue solved. From the logs, it looks like the communication was not happening between the nodes that could have been resolved after the reboot.
Regards,
Sudhir
Hi, Sudhir,
If communication was not happening, how come "efm cluster-status efm" always gave out the correct values?
Do you mean communication JGroups-wise?
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
By the way, the OS on the DB nodes is RHEL 7 and on the Witness node is Centos 7.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi @wcruz,
We recommend having all the servers on the same distribution of Linux.
To get a clearer picture of the possible root cause of the issue, could you please share the efm.properties files from all nodes and DB logs of the Master and Standby at the time you were performing the test.
Regards,
Sudhir
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi, Sudhir,
These are the efm.properties files:
db1: https://ctrlv.it/txt/216628/3397291979
db2: https://ctrlv.it/txt/216629/1812239174
witness: https://ctrlv.it/txt/216630/1082399809
Master Postgres log: https://ctrlv.it/txt/216644/662311518
Standby Postgres log: https://ctrlv.it/txt/216645/3978953598
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi wcruz,
To avoid delay in responses and to get a proper RCA for your issue, you can raise a case to "support@enterprisedb.com" with the subscribed user.
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi ,
Please check your properties file present on the blow entries.
FYI,
stop.isolated.master=true
stop.failed.master=true
auto.allow.hosts=true
stable.nodes.file=true
auto.failover=true
NOTE:- Auto failover = true definitly master donw standby prometed.
Thanks!!!
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
Re: Auto-Failover Not Happening (EFM-3.4)
Hi,
Could you please let me know if this was a one off issue or it is reproducible consistently in the environment? I saw in one of the response that the issue went away after restarting?
--Ankit Shukla