Understanding Oracle: Measurement Error Trap In Trace File (event 10046)

Saturday, 5 December 2009

Measurement Error Trap In Trace File (event 10046)

Some time ago I had an interesting case which I can use to clearly describe how one can be caught in measurement error trap.

But let us start at the beginning with this response time analysis:

Response Time Component                     Time      % Elap   AvgEla
---------------------------------------- ----------- ------- ---------
CPU service                                 3934.97s  48.39%  0.000716
un-accounted for time                       1363.01s  16.76%
db file sequential read                     1122.00s  13.80%  0.032253
gc buffer busy                               451.73s   5.56%  0.011746
log buffer space                             451.64s   5.55%  0.123974
buffer busy waits                            176.79s   2.17%  0.029579
gc cr block 2-way                            156.49s   1.92%  0.003287
gc cr grant 2-way                            100.20s   1.23%  0.006541
latch: cache buffers chains                   98.92s   1.22%  0.005708
gc current grant 2-way                        69.68s   0.86%  0.006728
latch: library cache                          30.10s   0.37%  0.010030
row cache lock                                28.95s   0.36%  0.018727
gc current block 2-way                        26.72s   0.33%  0.003828
gc cr block busy                              19.35s   0.24%  0.006802
gc current grant busy                         15.30s   0.19%  0.004999
latch: row cache objects                      14.28s   0.18%  0.006165
gc cr block 3-way                             11.73s   0.14%  0.002952
gc current block 3-way                        11.34s   0.14%  0.003440
log file sync                                 10.71s   0.13%  0.315066
enq: SQ - contention                           9.14s   0.11%  0.060911

My first thought was that there is an I/O problem as the average single block I/O took 32 milliseconds.

After digging for a while I have produced the following graphs which represent the I/O timing.

Let's first look at some facts:
- 3 node RAC
- Same storage
- Single block read time for Instance 1 was substantially different from read times for other instances during off hours
- 5 batch jobs during off hours
- The timings for Instance 1 are obviously not correct

I used the fact that I was performing the analysis on a 3 node RAC to check what are the timings on other two nodes and as you can see from the second graph, they were quite different.

Here is now the explanation:

The output from strace showed:

gettimeofday({1159440978, 931945}, NULL) = 0
pread(14, "\6\242\0\0\375\23\0\2+\254.\0\0\0\1\6\0054\0\0\1\0\5\0"..., 8192, 455 057408) = 8192
gettimeofday({1159440978, 944159}, NULL) = 0

Oracle records time just before performing a system call (pread). When the system call completes Oracle again records the current time and the difference reports as a wait time. Unfortunately due to high CPU load the process was for quite a while waiting at the system level to get on CPU and only then was able to read the current time and therefore the reported elapsed time was quite exaggerated.

Conclusion: Waiting in runque for CPU exaggerates all wait times of the process.

It is always a good practice to confirm findings with a different method. In this case I could use also the operating system tools to measure I/O timings. The same situation one can have on any other kind of wait event but there is not always a possibility to check it independently.

Finally, there was not really an I/O problem but the system was quite CPU bound. Of course the average single block I/O time over 5 - 10 ms shows that probably we are experiencing also I/O bottleneck. The customer later on replaced disk storage with a faster one together with the HW used for RAC.

2 comments:

Bernard Polarski17 December 2009 at 09:13
"wainting in runque for CPU" is also called CPU starving.
ReplyDelete
Replies
Jože Senegačnik - Joc17 December 2009 at 09:27
Bernard,
Thanks for the nice suggestion. Right now I'm trying to loose additional 10 kg and I really know what means starving :-)
ReplyDelete
Replies

Add comment

Understanding Oracle

Saturday, 5 December 2009

Measurement Error Trap In Trace File (event 10046)

2 comments:

Search This Blog

Forthcoming Public Appearances

My training seminars

About Me

Co-authoring A Book

Apress Author

OakTable Network

OTN Membership

My Other Blogs

Blog Archive

My favorite blogs

Understanding Oracle

Saturday, 5 December 2009

Measurement Error Trap In Trace File (event 10046)

2 comments:

Search This Blog

Subscribe To

Forthcoming Public Appearances

My training seminars

About Me

Co-authoring A Book

Apress Author

OakTable Network

OTN Membership

My Other Blogs

Blog Archive

My favorite blogs