cancel
Showing results for 
Search instead for 
Did you mean: 

Determining Single-Master XDB Replication Lag

Moderator

I recently had a few customers ask me about using a query for monitoring SMR XDB lag, so they can keep tabs on replication progress and set up notifications.

 

Unfortunately, because of the way WAL works, it's a bit tedious to try to quantify lag in terms of rows (You'd have to take each WAL file, scan, and filter for unrelated WAL entries, based on sync_ids in the XDB control database).  Additionally, tracking WAL-based SMR replication lag cannot be done with an earlier solution.  Fortunately, it is possible to calculate the time lag with a query:

 

 

WITH 
src_db (current_xid_commit_timestamp) AS
  (SELECT timestamp FROM pg_last_committed_xact()),
-- last replicated timestamp for target databases
target_db (target_db_id, last_repl_xid_timestamp) AS
  (SELECT sub_db_id AS target_db_id, MAX(last_repl_xid_timestamp) AS last_repl_xid_timestamp
     FROM _edb_replicator_pub.rrep_txset rt JOIN _edb_replicator_sub.xdb_subscriptions xs ON xs.sub_id=rt.sub_id WHERE status = 'C'
     GROUP BY sub_db_id),
-- identity of target databases
target_db_identity (target_db_id, target_db_identity) AS
  (SELECT sub_db_id, db_host || ':' || db_port || ':' || db_name
     FROM _edb_replicator_sub.xdb_sub_database)
-- replication lag for each of the target databases 
SELECT c.target_db_id, target_db_identity, AGE(current_xid_commit_timestamp, last_repl_xid_timestamp) repl_time_lag
  FROM src_db a, target_db b, target_db_identity c WHERE b.target_db_id = c.target_db_id 
  ORDER BY target_db_id;

By examining the replication timestamps, we are able to get a good handle on the time delay between the associated servers.  Additionally, with some more tweaking, we can use a similar query to calculate time lag for Multi-Master Replication (MMR) clusters:

 

 

 

WITH 
src_db (current_xid_commit_timestamp) AS
  (SELECT timestamp FROM pg_last_committed_xact()),
-- last replicated timestamp for target databases
target_db (target_db_id, last_repl_xid_timestamp) AS
  (SELECT db_id AS target_db_id, MAX(last_repl_xid_timestamp) AS last_repl_xid_timestamp
     FROM _edb_replicator_pub.rrep_mmr_txset WHERE status = 'C'
     GROUP BY db_id),
-- identity of target databases
target_db_identity (target_db_id, target_db_identity) AS
  (SELECT pub_db_id, db_host || ':' || db_port || ':' || db_name
     FROM _edb_replicator_pub.xdb_pub_database)
-- replication lag for each of the target databases 
SELECT c.target_db_id, target_db_identity, AGE(current_xid_commit_timestamp, last_repl_xid_timestamp) repl_time_lag
  FROM src_db a, target_db b, target_db_identity c WHERE b.target_db_id = c.target_db_id 
  ORDER BY target_db_id;

 

 

With these queries, DBAs and sysadmins can easily set up notification and monitoring systems for XDB performance.

6 Comments
Adventurer

Hi @richyen

 

The provided query has been a lot helpful .

 

We had  MMR setup for test and prod databases .

 

But on the test database when there was no activity(inserts/updates) on the MDN  master ,the monitoring query lag increases even though synchronization happens with 0 TX count .

 

Is there any other way , to monitor exact lag ?

ed4.PNGed2.PNGed3.png

 

Thanks

Ashok

Moderator

Hi @ashokkoti, thanks for the comment and inqury.  As far as I understand your question, you seem to be concerned that while there is no database traffic generated by your application, you are wondering why the replication lag time is increasing.  I believe this is due to the database traffic generated by XDB, which logs things like when sync events happen, etc.  This traffic will update pg_last_committed_xact(), which is a part of the lag query.  So while pg_last_committed_xact() increases and no new traffic is generated in _edb_replicator_pub.rrep_mmr_txset, you will see a growing difference in the timestamps, making it look like there is some increasing lag.  Under normal application load, I believe the query will give you more "accurate" results, with respect to the application/DBA perspective.

 

Hope that helps,

--Richard

Adventurer

Hi Richyen,

 

Thanks for your blog. I have one internal test MMR setup where I am performing table insert and then stopping the xDB service to monitor the lag. I used your last query. The values we are receiving are in negative (eg. -00:29:57.166467)!

 

On MDN node, when there is a long running data insertion, the lag is visible in negative (like above value) only on MDN node, while on non-MDN node we see no value (i.e. the value is NULL). When I do long running data insert on non-MDN node then on lag on non-MDN node is again negative (value around -00:29:57.166467).

 

If I stop the xdb service, the value increases (i.e. -00:29 to -00:27 to -00:24 to -00:22 and so on...). This happens on both the nodes. When i resume the xDB service, the value decreases (i.e. -00:22 to -00:24 to -00:27 until it reaches -00:29:59). That is it. The value does not go below this!

 

Would you have any idea why this query would be giving such weird output ?

Moderator

Hi Pratmeht,

 

Regarding your question about negative lag representation, I think you need to keep in mind that in X - Y = lag, X is the last committed transaction timestamp.  So in a long-running insert, the transaction doesn't get committed until all the rows are inserted and your script/query has issued a COMMIT.  Therefore, while the insert is happening, X could remain static, while Y increases (since replication is continuing, and the timestamps associated with Y are not necessarily related to the WAL XID (which is where X's timestamp comes from).

 

I don't quite know what your second question pertains to (are you stopping the xdb service while the long-running insert is continuing?), but I suspect that the negative values are probably related to the above explanation.

 

Hope that helps!

--Richard

Adventurer

Hi 

 

 

Moderator

Hi @pratmeht, could you share with me what you are unable to interpret/understand?  Also, are you using WAL-based or Trigger-based replication?