A good DBA is someone who knows how to monitor the databases EFFECTIVELY and when It comes to a production or business critical database, the job becomes even more crucial.
Not monitoring your databases and system effectively may lead to a severe catastrophe - application slowness, database crashes or downtime, all of which eventually lead to Business loss and an unhappy customer. I am sure we want to avoid this kind of situation in every possible way.
To address this issue some of the basic questions which we need to understand are:
Should it be different for your Production/Critical and Non-critical databases?
Yes, it should be. Monitoring on Production or mission-critical database should be detailed and more frequent as compared to Non-critical(Development or Testing) database. For example, High load on production might ring a bell however in Non-prod environment it may not be critical.
As a Database Expert we divide our monitoring into different aspect:
o Database Down
o Streaming Replication
o Postgres Backend Connections [max connection]
o Bloat Report
o Transaction Wraparound
o Long running query
o Waiting queries
o Number of WAL files pending [.ready files]
o Diskspace alert
o Memory Usage
As a part of our best practice, We further divide our monitoring in terms of threshold values(when the alert should trigger) and the frequency of execution.
Should it be proactive or reactive monitoring?
The answer is... both!
A certain aspect of monitoring is based on the reactive approach of proactive monitoring. It might sound little confusing, however below scenario should be able to help us understand it better:
N= 10 mins of frequency
12:00AM – First execution: Connection to DB=700
12:01AM – Connection to DB=700
12:02AM – Connection to DB=700
12:03AM – Connection to DB=750
12:04AM – Connection to DB=750
12:05AM -Connection to DB=800
12:06AM -Connection to DB=850
12:07AM – Connection to DB=900
12:08AM – Connection to DB=950
12:09AM -Connection to DB=970
12:10AM – Eleventh execution of Monitoring : Connection to DB=980 (Alert Triggered as Threshold <= No of DB connections)
12:11AM : Connection to DB=990
12:12AM : Connection to DB=1000
12:13AM – Database Out of Memory Error (System Crashed)
Result: DBA didn’t get enough opportunity to react to the alert notification.
Risk Factor: Critical
N= 1 mins of frequency (Proactive Approach)
12:00AM – First execution: Connection to DB=700
12:01AM – Second execution: Connection to DB=700
12:02AM – Third execution: Connection to DB=700
12:03AM – Forth execution: Connection to DB=750
12:04AM – Fifth execution: Connection to DB=750
12:05AM -Sixth execution: Connection to DB=800 (Alert Raised as WarningThreshold >= No of DB connections)
DBA: Login into the system and Start analyzing the issue
12:06AM -Seventh execution: Connection to DB=850(Alert Raised as WarningThreshold >= No of DB connections)
12:07AM – Eighth execution: Connection to DB=900 (Alert Raised: Critical )
12:08AM – Ninth execution: Connection to DB=950(Alert Raised: Critical)
12:09AM – Tenth execution: Connection to DB=970(Alert Raised: Critical)
DBA: Fixed the problem.
12:10AM – Eleventh execution of Monitoring: Connection to DB=800(Alert Triggered WarningThreshold <= No of DB connections)
12:11AM – Eleventh execution of Monitoring: Connection to DB=700
12:12AM – Eleventh execution of Monitoring : Connection to DB=750
12:13AM – Database Out of Memory Error — Never Happened
Result: DBA get enough opportunity to react to the alert notification.
Risk Factor: Medium
We have eliminated the Risk factor to the overall health of the database and SLA by doing smart and proactive monitoring before the bad thing starts happening to the database.
A classic example of reactive monitoring would be Database down the issue. Many of the times we can’t predict issues(hardware failure) with database(even with all the proactive monitoring), however getting a notification at the right time saves lot of time to react to the situation and fix the problem.
Should it be aligned with your business goals or SLA?
Understanding business goals/ SLA and aligning it with monitoring is very Important. In operations, we deal with many customers with very unique business goals & infrastructure. For one customer it may be important to track all the long-running queries which might be impacting application performance and for another, it may not be so critical.
Does your Monitoring System gives you POWER of decision making based upon your SLA?
Don’t read further if:
To solve all of the above challenges of Operations DBA, Postgres Enterprise Manager(PEM) provides a very unique solution to...
From DBA’s and Operations perspective, it brings values like:
EDB Postgres Enterprise Manager’s (PEM) collects performance and status data from the databases it monitors, from the operating system, and it collects data resulting from jobs it executes on the database host. Data is displayed in dashboards and analyzed for alert conditions. Alerts can be relayed to operators or to other enterprise-level management systems. PEM also allows DBAs and developers to work with the databases directly, execute DDL and DML commands, tune queries, manage indexes, run backups, and deploy software updates.
You will be hearing soon from us on “Are you running your 24 by 7 Database Operations Smoothly?”
Also please drop me a message if there is any further query or you want to add something on the topic. Will be glad to hear from you guys.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.