cancel
Showing results for 
Search instead for 
Did you mean: 

Root cause of Master down.

Level 3 Adventurer

Root cause of Master down.

Hi DBA's

 

1.Here Master is down..slave promote as Master with the responsible of EFM

2.Start the new slave and get into sync.

 

The proble is What is the Root Cause?

 

Database logs:-

at 2019-02-14 16:54:00 SGT - from Application [[unknown]] on Server[10.138.98.41(56186)] using User [[unknown]] for dbname [[unknown]] process id [28507] returned SQLSTATE [00000]: LOG: connection received: host=10.138.98.41 port=56186
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [58P01]: LOG: could not open file "postmaster.pid": No such file or directory
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [00000]: LOG: performing immediate shutdown because data directory lock file is invalid
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [00000]: LOG: received immediate shutdown request
at 2019-02-14 16:54:00 SGT - from Application [[unknown]] on Server[10.138.98.41(56186)] using User [postgres] for dbname [XXXXX] process id [28507] returned SQLSTATE [57P02]: WARNING: terminating connection because of crash of another server process

Os logs:-
Feb 14 16:51:52 prod-master01 xinetd[8376]: START: nrpe pid=28123 from=::ffff:10.138.90.205
Feb 14 16:51:52 prod-master01 xinetd[8376]: EXIT: nrpe status=0 pid=28123 duration=0(sec)
Feb 14 16:52:08 prod-master01 xinetd[8376]: START: nrpe pid=28172 from=::ffff:10.138.90.205
Feb 14 16:52:09 prod-master01 xinetd[8376]: EXIT: nrpe status=0 pid=28172 duration=1(sec)
Feb 14 16:53:18 prod-master01 xinetd[8376]: START: nrpe pid=28342 from=::ffff:10.138.90.205
Feb 14 16:53:18 prod-master01 xinetd[8376]: EXIT: nrpe status=0 pid=28342 duration=1(sec)
Feb 14 16:53:40 prod-master01 systemd: Starting EDB Postgres Advanced Server 9.6...
Feb 14 16:53:40 prod-master01 bash: touch: cannot touch ‘/DATA/9.6/data/pg_log/startup.log’: No such file or directory
Feb 14 16:53:40 prod-master01 bash: /usr/lib/systemd/system/edb-as-9.6.sh: line 6: /DATA/9.6/data/pg_log/startup.log: No such file or directory
Feb 14 16:53:40 prod-master01 bash: WARNING --> PERL_INSTALL_PATH is not set in /POSTGRES/9.6/as9.6/etc/sysconfig/plLanguages.config file
Feb 14 16:53:40 prod-master01 bash: WARNING --> PYTHON_INSTALL_PATH is not set in /POSTGRES/9.6/as9.6/etc/sysconfig/plLanguages.config file
Feb 14 16:53:40 prod-master01 bash: WARNING --> TCL_INSTALL_PATH is not set in /POSTGRES/9.6/as9.6/etc/sysconfig/plLanguages.config file
Feb 14 16:53:40 prod-master01 bash: pg_ctl: another server might be running; trying to start server anyway
Feb 14 16:53:41 prod-master01 bash: waiting for server to start..../bin/sh: /DATA/9.6/data/pg_log/startup.log: No such file or directory
Feb 14 16:53:42 prod-master01 bash: stopped waiting
Feb 14 16:53:42 prod-master01 bash: pg_ctl: could not start server
Feb 14 16:53:42 prod-master01 bash: Examine the log output.
Feb 14 16:53:42 prod-master01 systemd: edb-as-9.6.service: control process exited, code=exited status=1
Feb 14 16:53:42 prod-master01 systemd: Failed to start EDB Postgres Advanced Server 9.6.
Feb 14 16:53:42 prod-master01 systemd: Unit edb-as-9.6.service entered failed state.
Feb 14 16:53:42 prod-master01 systemd: edb-as-9.6.service failed.

 

EFM logs
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending 1 msgs (35 bytes (0.05 of max_bundle_size) to 1 dests(s): [efm]
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: dest=10.138.98.43:9006 (38 bytes)
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: dest=10.138.98.42:9006 (38 bytes)
2/14/19 4:51:55 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: received [dst: <null>, src: prod-master03-43102 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, TCP: [cluster_name=efm]
2/14/19 4:52:00 PM com.enterprisedb.efm.DBMonitor checkOnce FINEST: DBMonitor testing jdbc:postgresql://10.138.98.41:5006/efm_check_db?ApplicationName='efm-2.1'
2/14/19 4:52:00 PM com.enterprisedb.efm.DBMonitor checkOnce FINEST: Acquired connection after 90 ms
2/14/19 4:52:00 PM com.enterprisedb.efm.DBMonitor checkOnce FINEST: Prepared statement after 0 ms
2/14/19 4:52:00 PM com.enterprisedb.efm.DBMonitor checkDB FINEST: Query took 9 ms.
2/14/19 4:52:00 PM com.enterprisedb.efm.DBMonitor$4$1 run FINEST: Updating timestamp.
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: received [dst: <null>, src: prod-master02-55269 (2 headers), size=49 bytes, flags=OOB|DONT_BUNDLE|INTERNAL], headers are TCPPING: [type=GET_MBRS_REQ, cluster=efm], TCP: [cluster_name=efm]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: received GET_MBRS_REQ from prod-master02-55269, sending response prod-master01-18103, name=prod-master01-18103, addr=10.138.98.41:9006, server
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending msg to prod-master02-55269, src=prod-master01-18103, headers are TCPPING: [type=GET_MBRS_RSP], TCP: [cluster_name=efm]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending 1 msgs (107 bytes (0.17 of max_bundle_size) to 1 dests(s): [efm:prod-master02-55269]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: dest=10.138.98.42:9006 (110 bytes)
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: received [dst: <null>, src: prod-master02-55269 (2 headers), size=0 bytes, flags=INTERNAL], headers are FD_ALL: heartbeat, TCP: [cluster_name=efm]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending msg to null, src=prod-master01-18103, headers are FD_ALL: heartbeat, TCP: [cluster_name=efm]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: looping back message [dst: <null>, src: prod-master01-18103 (2 headers), size=0 bytes, flags=INTERNAL]
2/14/19 4:52:01 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending 1 msgs (35 bytes (0.05 of max_bundle_size) to 1 dests(s): [efm]

 

 

NOTE:-

 Please suggested me soon..

Tags (1)
7 REPLIES 7
EDB Team Member

Re: Root cause of Master down.

Hi Din2pg,

 

As per your database logs, the postmaster.pid file which is the main process file was removed. The database gets the signal to immediately shut down if the pid file is not present and that is reason why the datbase went down.  This is expected behaviour.

 

Level 3 Adventurer

Re: Root cause of Master down.

Hi @chaitalirs 

 

               it's ok.i understand.but why the postmaster.pid file removed..this server more that 8 month running fine..but yesterday only went down..

Can u mention..What are the scenarios postmaster.pid file removed..if any other reason..any  others???

 

Awaiting your Replay

 

Thanks

EDB Team Member

Re: Root cause of Master down.

The postmanter.pid files is removed when we have a safe shut-down. There is nothing else which automatically removes the postmaster.pid.

 

It might be accidently removed by someone at your end, this is something which you can investigate at your end.

Level 3 Adventurer

Re: Root cause of Master down.

Hi @chaitalirs 

              i want to know something from your side..

Without human activity..is it any posible to removed the file automatic or any crash happend like that.

Can u explain clearly.

 

Thanks

EDB Team Member

Re: Root cause of Master down.

As far as I know, apart from human activity there is nothing which can cause removal of postmaster.pid file.

 

There might be a reason where the postmaster.pid is not readable i.e the permissions on data directory changed.

Level 3 Adventurer

Re: Root cause of Master down.

Hi @chaitalirs 

             Thanks for your Prompt Reply..

Here 

1.No permission changed

2.We cant find who deleted the file in Database log,efmlogs.pem logs,os logs.

NOTE:-

           FYI

                       that shutdown time only one applicatio user is running and pem agent monitering also..once down the server vip enabled all other connection trasfreed to 

slave---newly convert master..but the suspect ip till now not connect to master server..Any idea of this post..

 

 

Example

at 2019-02-14 16:53:59 SGT - from Application [[unknown]] on Server[10.138.181.29(55248)] using User [[unknown]] for dbname [[unknown]] process id [28500] returned SQLSTATE [00000]: LOG: connection received: host=10.138.181.29 port=55248
at 2019-02-14 16:53:59 SGT - from Application [[unknown]] on Server[10.138.181.29(55248)] using User [ap_user] for dbname [ap_prd] process id [28500] returned SQLSTATE [00000]: LOG: connection authorized: user=ap_user database=ap_prd
at 2019-02-14 16:53:59 SGT - from Application [PostgreSQL JDBC Driver] on Server[10.138.181.29(55246)] using User [a_ser] for dbname [ap_prd] process id [28497] returned SQLSTATE [00000]: LOG: disconnection: session time: 0:00:03.049 user=ap_user database=ap_prd host=10.138.181.29 port=55246
at 2019-02-14 16:53:59 SGT - from Application [[unknown]] on Server[10.138.181.29(55250)] using User [[unknown]] for dbname [[unknown]] process id [28501] returned SQLSTATE [00000]: LOG: connection received: host=10.138.181.29 port=55250
at 2019-02-14 16:53:59 SGT - from Application [[unknown]] on Server[10.138.181.29(55250)] using User [ap_user] for dbname [ap_prd] process id [28501] returned SQLSTATE [00000]: LOG: connection authorized: user=ap_user database=ap_prd
at 2019-02-14 16:54:00 SGT - from Application [[unknown]] on Server[10.138.181.41(56186)] using User [[unknown]] for dbname [[unknown]] process id [28507] returned SQLSTATE [00000]: LOG: connection received: host=10.138.181.41 port=56186
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [58P01]: LOG: could not open file "postmaster.pid": No such file or directory
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [00000]: LOG: performing immediate shutdown because data directory lock file is invalid
at 2019-02-14 16:54:00 SGT - from Application [] on Server[] using User [] for dbname [] process id [19317] returned SQLSTATE [00000]: LOG: received immediate shutdown request

Highlighted
EDB Team Member

Re: Root cause of Master down.


@Din2pg wrote:

EFM logs
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: prod-master01-18103: sending 1 msgs (35 bytes (0.05 of max_bundle_size) to 1 dests(s): [efm]
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: dest=10.138.98.43:9006 (38 bytes)
2/14/19 4:51:53 PM org.jgroups.logging.JDKLogImpl trace FINER: dest=10.138.98.42:9006 (38 bytes)

 

 

This isn't related to your real problem, but I'd like to point out that you shouldn't run EFM with the logging turned up like that. Having the jgroups logging at that level is not a supported configuration. As the property description says, you should only turn up logging if debugging a specific problem.

 

Bobby