cancel
Showing results for 
Search instead for 
Did you mean: 

Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:1242

Adventurer

Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:1242

Hi

I have detected the error [2019-06-05 04:08:18 CEST]->->[][PID 90543][SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:1242 on our postgresql server and later database stop:

[2019-06-05 04:11:02 CEST]->->[][PID 24131][SQL STATE: 00000]DEBUG: server process (PID 9192) exited with exit code 1
[2019-06-05 04:11:19 CEST]->->[][PID 24131][SQL STATE: 00000]LOG: checkpointer process (PID 90543) was terminated by signal 6: Aborted
[2019-06-05 04:11:19 CEST]->->[][PID 24131][SQL STATE: 00000]LOG: terminating any other active server processes
[2019-06-05 04:11:19 CEST]->->[][PID 24131][SQL STATE: 00000]DEBUG: sending SIGQUIT to process 90547

 

Does anybody know what it means error stuck spinlock (0x7fa36e621944) detected at bufmgr.c:1242 ?

Thanks

5 REPLIES 5
EDB Team Member

Re: Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124

Hi jsoler,

 

Can you share your environmental details like PostgreSQL version, PostgreSQL.conf, exact steps where you encountered this issue, and your OS version?

 

Above details will help us to investigate the issue in more details.

 

Regards,

Dhananjay

Adventurer

Re: Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124

the production cluster is a two databases server controlled by heartbeat using a shared storage managed by drbd and a virtual ip also managed by heartbeat

PostgreSQL 9.2.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit

CentOS release 6.5 (Final)

This error showed up two times around 04:00 am and there were almost no activity at the database server, I mean, main application was running but with low activity.  Every night our cron executes some backups job,  every 00:30 filesystem copy of database, 01:30 copy of logs from database, 02:30 logical backup of main database, 03:45 copy of archived logs, 03:55 purge of old backup copies , 04:00 copy backups to remote site via nfs.

 

Sar info:

Linux 2.6.32-431.el6.x86_64 (dninbbd1.policia.es) 05/06/19 _x86_64_ (8 CPU)

00:00:01 CPU %user %nice %system %iowait %steal %idle
00:10:01 all 12,31 0,00 2,31 1,88 0,00 83,50
00:20:01 all 13,72 0,00 2,23 1,42 0,00 82,64
00:30:01 all 13,73 0,00 2,23 0,77 0,00 83,27
00:40:01 all 13,78 0,00 2,28 1,82 0,00 82,12
00:50:01 all 13,60 0,00 2,35 2,14 0,00 81,91
01:00:01 all 13,96 0,00 2,19 1,53 0,00 82,32
01:10:01 all 12,24 0,00 2,23 1,77 0,00 83,76
01:20:01 all 13,92 0,00 2,20 1,71 0,00 82,17
01:30:01 all 13,78 0,00 2,31 1,95 0,00 81,96
01:40:01 all 13,79 0,00 3,64 4,31 0,00 78,26
01:50:01 all 14,06 0,00 2,24 1,44 0,00 82,26
02:00:01 all 14,12 0,00 2,17 1,48 0,00 82,23
02:10:01 all 12,50 0,00 2,31 2,33 0,00 82,87
02:20:01 all 13,99 0,00 2,43 2,70 0,00 80,88
02:30:01 all 14,00 0,00 2,23 2,36 0,00 81,41
02:40:01 all 17,18 0,00 4,45 10,55 0,00 67,82
02:50:01 all 25,86 0,00 2,82 2,20 0,00 69,12
03:00:01 all 26,06 0,00 2,66 2,49 0,00 68,79
03:10:01 all 25,39 0,00 3,43 3,82 0,00 67,37
03:20:01 all 14,16 0,00 2,45 3,22 0,00 80,17
03:30:01 all 13,96 0,00 2,43 3,21 0,00 80,40
03:40:01 all 14,15 0,01 2,46 2,24 0,00 81,14
03:50:01 all 13,99 0,00 2,36 2,30 0,00 81,35
04:00:01 all 13,99 0,00 2,42 2,34 0,00 81,25
04:10:43 all 0,99 0,00 1,46 1,07 0,00 96,48
04:20:01 all 0,43 0,00 1,60 15,64 0,00 82,33
04:30:01 all 0,01 0,00 0,04 0,00 0,00 99,95
04:40:01 all 0,01 0,00 0,03 0,00 0,00 99,96
04:50:01 all 0,01 0,00 0,04 0,00 0,00 99,95
05:00:01 all 0,01 0,00 0,04 0,00 0,00 99,95
05:10:01 all 0,01 0,00 0,04 0,00 0,00 99,95
05:20:01 all 0,01 0,00 0,04 0,00 0,00 99,95
05:30:01 all 0,01 0,00 0,04 0,00 0,00 99,96
05:40:01 all 0,01 0,00 0,04 0,00 0,00 99,95
05:50:02 all 0,01 0,00 0,03 0,00 0,00 99,96
06:00:01 all 0,01 0,00 0,04 0,00 0,00 99,95

 

at 04:20 i/o activity was incremented because transfering of backups to nfs but I don't see anything else.  Do you know what does mean [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124 ?

 

EDB Team Member

Re: Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124

Hi @jsoler,

 

Kindly note that error: PANIC: stuck spinlock error reported bug and got resolved in PostgreSQL 9.3.3.

 

Kindly check below release note of PostgreSQL 9.3.3 for the details :
https://www.postgresql.org/docs/9.3/release-9-3-3.html

 

As you are using NFS, We have some evidence that these crashes are storage-related.

 

Especially if there were network hickups etc. NFS is notorious for leaving processes stuck in kernel for hours upon problems.

 

If someone stopped a process using kill -STOP or a debugger just as it was in the middle of acquiring the LWLock, that could be a possible reason for the occurrence of stuck spinlock.

 

The postgres version you are using is old, It looks like the support of Postgres 9.2 was ended in September 2017 [URL].

 

It is recommended that all users run the latest available minor release for whatever major version is in use.

Adventurer

Re: Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124

Thanks, we know that this is a pretty old postgres version and we willing to upgrade it but there is some burocracy to complete to do it. 

I am not sure if you have understood that our postgres is located on a nfs filesystem, but it doesn't. PG is working on top of ext4 filesystem managed by drbd storage ( low-level replication storage). NFS is it just the target directory to save backups files so i am not sure what you think this error is related to nfs.  I have checked and there were no stopped process by kill or debugger .

Do you know what exactly means: "Prevent timeout interrupts from taking control away from mainline code unless ImmediateInterruptOK is set (Andres Freund, Tom Lane)

This is a serious issue for any application making use of statement timeouts, as it could cause all manner of strange failures after a timeout occurred. We have seen reports of "stuck" spinlocks, ERRORs being unexpectedly promoted to PANICs, unkillable backends, and other misbehaviors."  ?

 

Regards

 

EDB Team Member

Re: Cause of error [SQL STATE: XX000]PANIC: stuck spinlock (0x7fa36e621944) detected at bufmgr.c:124

Hi jsoler,

 

Please find an explanation for the statement.

 

If any transaction gets aborted, the transaction clean up won't occur properly, hence it stops the mainline code from completing the functionality.

So, the fixed is provided to protect again above explained scenario in cases where it ought to return to the mainline code, but allow it where there's an interruptible process happening.

The timeout can cause such scenarios (statement_timeout). Also, this might also be caused by lock_timeout.

 

Hope this clarifies the raised concerns.