A customer's environment kept slowing down every afternoon for a few weeks in a row. Most of the time the slowdown would only last 10 or 20 minutes and then return to normal. However on certain days, performance would degrade and keep decaying until they were forced to shutdown and restart the database. Then the problem would go away until the next day. The claim was that nothing has changed recently to trigger this behavior.
So it was time to glance at some Statspack reports. During acceptable periods, top Timed Events were CPU time and some IO related events. First statspack when the problem starts shows "latch free" jumping to the lead. If the problem corrects itself, CPU returns to the top and everything pretty much looks like the BEFORE problem report again. But on the days where the problem did not correct itself, IO events jumped to the top followed by "latch free" and "buffer busy waits". Interesting.
Jumping to the latch stats section the first symptom was consistently high library cache latch. Then those waits would disappear and on the really bad days "cache buffers chains" jumped to the top. Okay, first symptom is usually a sign that there's some churn in the shared pool with old SQL getting aged out and new ones coming. And when IO becomes a problem (with high buffer cache activity, etc.), it implies some new IO heavy SQL got introduced.
However, the customer says the application transaction mix should not be changing. Top SQL shows all the same SQL with relatively same ratios of execution (all lower of course). But physical reads and buffer gets for one SQL has increased markedly -- execution plan must have changed! But customer swears that they are not loading massive amounts of data or gathering stats in the middle of some days, plus it seems like the old execution plan comes back after bouncing the database.
So what's the simplest story that would fit the symptoms?
Answering that required knowing or finding out a couple of bits of information about how Oracle works. "Are you guys are creating new user accounts every day at noon now and granting them privileges to the application tables?" Turns out the answer is positive and they had changed from running those scripts overnight to middle of the day. First part of mystery explained. Not many people know that grants on tables will invalidate dependent SQL in the shared pool. Didn't think the change was significant enough to mention.
But what about the second part? Ever heard of bind peeking? CBO will calculate some costs for inequality predicates (or equality predicates against columns with histograms) based on the bind value of the first person who loads the SQL into the shared pool -- or if it is getting reloaded after a grant has invalidated the old version. Customer just had some bad luck the last couple of weeks that the first hard parse may have put in values that were not representative of typical usage.
Final solution, only issue grants during off hours again. To be safe about the bind peeking, they associated an outline of the good plan to the dangerous SQL. Adding a HINT wasn't an option as it was a third party app. Disabling bind peeking could have broken other SQL plans that needed that feature.
Below is a simplified case to illustrate what was happening. Easier to see if it is done from two SQL*Plus sessions sitting side by side:
Session 1:
REM T1: Set up test case
var x number;
var y number;
create table bigtab as select * from all_objects;
create index bt_ix on bigtab (object_id);
execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -
tabname=>'BIGTAB', CASCADE => TRUE, -
method_opt => 'FOR ALL COLUMNS SIZE 1');
REM Start with narrow range of values
begin :x := 1000; :y := 1001; end;
/
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM Goto session 2 - exec plan should be index range scan
REM T3: now lets re-execute with a wide range
begin :x := 0; :y := 50000; end;
/
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM Goto session 2 again - exec plan should still be the same
REM T5: So lets invalidate this puppy
GRANT SELECT ON BIGTAB TO ORDSYS;
REM now query against v$sql_plan and v$sql in session 2 shows
REM "no rows selected"
REM T7: so now lets reload the cursor
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM now session 2 shows a new plan with FAST FULL SCAN and it will
REM be used from now on no matter what the bind values are.
REM T9: so let's turn off bind peeking off
alter session set "_optim_peek_user_binds" = false;
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM now v$sql_area will show version count of 2 (cause some other
REM session may still have bind peeking enabled)
Session 2:
column operation format a20
column options format a20
column object_name format a20
REM T2: After initial load with narrow range binds
select sql_id, sql_text from v$sql where sql_text like
'SELECT COUNT(*) FROM BIGTAB%';
SQL_ID
-------------
SQL_TEXT
-----------------------------------------------------------------
2aa40mj45939v
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX
REM T4: Plan for second SQL still the same
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX
REM T6: after grant is issued
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
no rows selected
select sql_id from v$sql where sql_id = '2aa40mj45939v';
no rows selected
REM T8: Now cursor is reloaded
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX FAST FULL SCAN BT_IX
select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';
LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
---------- ------------- ---------- -------------
2 1 1 1
REM T10: After session altered "_optim_peek_user_binds" = FALSE
REM and query re-executed.
select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';
LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
---------- ------------- ---------- -------------
3 1 2 2
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v' order by plan_hash_value;
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX FAST FULL SCAN BT_IX
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX RANGE SCAN BT_IX
7 comments:
Nice one.
It would be lot easier to read it if there was more width. Currently it looks a bit crampy and some of the sql text/comments are going over the darker area :)
Sudhi, thanks for the heads up. What browser are you viewing with? Firefox 1.5 and Netscape 7.1 seems to show this okay (though with very tiny text). IE messes up the screen such that the fonts are larger but the rightmost "gutter" column is pushed way down. I might be able to fix that by shrinking the font. None of my browsers show an overflow.
I might try to tweak the blog template over the next few weeks. Ideally, I wanted to display the listings side by side, but not sure there's enough room for that even if I try to use the full screen width of a standard monitor.
Actually you are right. My FireFox 1.0.6 also shows everything without spilling it out, but the text is so tiny that I just couldn't finish a paragraph with it. Thats when I did a font increase to get a similar effect of IE.
Are we going to get more of "ORCL" prefixed entries??
What release did you try this? At least in 10.2.0.1 bind_peeking is truned off...isin't it? I had to turn it on explicitly to get the change in plan.
I did my test on a fairly empty 10gR2 RAC system on Solaris. "_optim_peek_user_binds" is set to true if optimizer_features_enable = 9.0.0 or higher. Probably should have mentioned that somewhere. Thanks.
Hey Roderick! Always fun reading with these kinds of cases. Reminds me of some issues I saw a long time ago on an apps db OM schema where the DBA was granting privs to developers on the prod system (seeing lots of activity on the table) hanging out for hours.
Good to see you outside the firewall. . .
Nice one.
But when I try to try this on Oracle Database 10g Enterprise Edition Release 10.2.0.2.0
I could not able to reproduce 2 different plan hash value.
it was able to do mulitple version count but not multiple plan.
Please advise.
thanks
Post a Comment