可以看到的确有会话(1497)正在以共享方式持有idn=7fff00005336的mutex。往前查看,可以看到会话信息:
(session) sid: 1497 ser: 29349 trans: (nil), creator: 0x45cccc3f0
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x408) -/-
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 14005 min 23 sec ago
Wait State:
auto_close=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
0: waited for 'db file sequential read'
file#=2, block#=1952c, blocks=1
wait_id=1171 seq_num=1174 snap_id=1
wait times: snap=0.018420 sec, exc=0.018420 sec, total=0.018420 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000034 sec of elapsed time
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x408) -/-
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 14005 min 23 sec ago
Wait State:
auto_close=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
0: waited for 'db file sequential read'
file#=2, block#=1952c, blocks=1
wait_id=1171 seq_num=1174 snap_id=1
wait times: snap=0.018420 sec, exc=0.018420 sec, total=0.018420 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000034 sec of elapsed time
再往前可以找到进程信息:
PROCESS 38:
----------------------------------------
SO: 0x45cccc3f0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x45cccc3f0, name=process, file=ksu.h LINE:10706, pg=0
(process) Oracle pid:38, ser:76, calls cur/top: 0x440d0ae48/0x43c5e5f88
flags : (0x2) SYSTEM
flags2: (0x30), flags3: (0x0)
int error: 0, call error: 0, sess error: 0, txn error 0
ksudlp FALSE at location: 0
(post info) last post received: 0 0 42
last post received-location: ksv2.h LINE:1603 ID:ksvpst: run
last process to post me: 450d4b4f0 1 2
last post sent: 0 0 151
last post sent-location: kcrf.h LINE:3095 ID:kcrfw_redo_gen: wake LGWR after redo copy
last process posted by me: 45ccbe3f0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x4510353e0
O/S info: user: oracle, term: UNKNOWN, ospid: 13549
OSD pid info: Unix process pid: 13549, image: oracle@xxxxx (m001)
----------------------------------------
SO: 0x45cccc3f0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x45cccc3f0, name=process, file=ksu.h LINE:10706, pg=0
(process) Oracle pid:38, ser:76, calls cur/top: 0x440d0ae48/0x43c5e5f88
flags : (0x2) SYSTEM
flags2: (0x30), flags3: (0x0)
int error: 0, call error: 0, sess error: 0, txn error 0
ksudlp FALSE at location: 0
(post info) last post received: 0 0 42
last post received-location: ksv2.h LINE:1603 ID:ksvpst: run
last process to post me: 450d4b4f0 1 2
last post sent: 0 0 151
last post sent-location: kcrf.h LINE:3095 ID:kcrfw_redo_gen: wake LGWR after redo copy
last process posted by me: 45ccbe3f0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 0x4510353e0
O/S info: user: oracle, term: UNKNOWN, ospid: 13549
OSD pid info: Unix process pid: 13549, image: oracle@xxxxx (m001)
可以看到m001这个进程持有了mutex。但是会话没有在执行任何sql语句,而从会话的等待来看,last wait ended 14005 min 23 sec ago ,这个是不正常的,m001进程应该是出现了异常。
m001进程是什么,这是mmon的slave进程,其中一个作用就是执行awr snapshot。很显然这个进程要访问v$sqlstat,是需要持有Cursor Stat类型的mutex的。
至此就找到了所有等待的root holder。不过把这个结果告诉维护这个库的朋友时,告知服务器之前“死”了,已经重启了。重启后检查数据库,发现在故障期间很长一段时间内的确是没有产生awr snapshot。至于m001进程出了什么异常,为什么会出现异常,本文就不再详细讨论。实际上服务器重启之后也没有继续跟踪这一问题。
mutex从设计上来说,与latch相比有着很明显的优点,只是作为新生事务(相对于目前仍然在广泛使用的10g来说), 存在BUG的可能性比较大。深入了解mutex,可以帮助我们更好地诊断mutex引起的相关问题。