Thursday, December 15, 2005

ORCL: Mysterious Performance Drop

This is based on true stories that I've run into about 3 times over the last few years (though coincidentally two of these cases appeared in the same week):

A customer's environment kept slowing down every afternoon for a few weeks in a row. Most of the time the slowdown would only last 10 or 20 minutes and then return to normal. However on certain days, performance would degrade and keep decaying until they were forced to shutdown and restart the database. Then the problem would go away until the next day. The claim was that nothing has changed recently to trigger this behavior.

So it was time to glance at some Statspack reports. During acceptable periods, top Timed Events were CPU time and some IO related events. First statspack when the problem starts shows "latch free" jumping to the lead. If the problem corrects itself, CPU returns to the top and everything pretty much looks like the BEFORE problem report again. But on the days where the problem did not correct itself, IO events jumped to the top followed by "latch free" and "buffer busy waits". Interesting.

Jumping to the latch stats section the first symptom was consistently high library cache latch. Then those waits would disappear and on the really bad days "cache buffers chains" jumped to the top. Okay, first symptom is usually a sign that there's some churn in the shared pool with old SQL getting aged out and new ones coming. And when IO becomes a problem (with high buffer cache activity, etc.), it implies some new IO heavy SQL got introduced.

However, the customer says the application transaction mix should not be changing. Top SQL shows all the same SQL with relatively same ratios of execution (all lower of course). But physical reads and buffer gets for one SQL has increased markedly -- execution plan must have changed! But customer swears that they are not loading massive amounts of data or gathering stats in the middle of some days, plus it seems like the old execution plan comes back after bouncing the database.

So what's the simplest story that would fit the symptoms?

Answering that required knowing or finding out a couple of bits of information about how Oracle works. "Are you guys are creating new user accounts every day at noon now and granting them privileges to the application tables?" Turns out the answer is positive and they had changed from running those scripts overnight to middle of the day. First part of mystery explained. Not many people know that grants on tables will invalidate dependent SQL in the shared pool. Didn't think the change was significant enough to mention.

But what about the second part? Ever heard of bind peeking? CBO will calculate some costs for inequality predicates (or equality predicates against columns with histograms) based on the bind value of the first person who loads the SQL into the shared pool -- or if it is getting reloaded after a grant has invalidated the old version. Customer just had some bad luck the last couple of weeks that the first hard parse may have put in values that were not representative of typical usage.

Final solution, only issue grants during off hours again. To be safe about the bind peeking, they associated an outline of the good plan to the dangerous SQL. Adding a HINT wasn't an option as it was a third party app. Disabling bind peeking could have broken other SQL plans that needed that feature.

Below is a simplified case to illustrate what was happening. Easier to see if it is done from two SQL*Plus sessions sitting side by side:

Session 1:
REM T1: Set up test case
var x number;
var y number;

create table bigtab as select * from all_objects;
create index bt_ix on bigtab (object_id);

execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -
tabname=>'BIGTAB', CASCADE => TRUE, -
method_opt => 'FOR ALL COLUMNS SIZE 1');

REM Start with narrow range of values
begin :x := 1000; :y := 1001; end;
/

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM Goto session 2 - exec plan should be index range scan
REM T3: now lets re-execute with a wide range

begin :x := 0; :y := 50000; end;
/

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM Goto session 2 again - exec plan should still be the same

REM T5: So lets invalidate this puppy

GRANT SELECT ON BIGTAB TO ORDSYS;

REM now query against v$sql_plan and v$sql in session 2 shows
REM "no rows selected"
REM T7: so now lets reload the cursor

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM now session 2 shows a new plan with FAST FULL SCAN and it will
REM be used from now on no matter what the bind values are.

REM T9: so let's turn off bind peeking off

alter session set "_optim_peek_user_binds" = false;

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM now v$sql_area will show version count of 2 (cause some other
REM session may still have bind peeking enabled)


Session 2:
column operation format a20
column options format a20
column object_name format a20

REM T2: After initial load with narrow range binds

select sql_id, sql_text from v$sql where sql_text like
'SELECT COUNT(*) FROM BIGTAB%';

SQL_ID
-------------
SQL_TEXT
-----------------------------------------------------------------
2aa40mj45939v
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y



select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX

REM T4: Plan for second SQL still the same
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX

REM T6: after grant is issued

select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';

no rows selected

select sql_id from v$sql where sql_id = '2aa40mj45939v';

no rows selected

REM T8: Now cursor is reloaded

select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX FAST FULL SCAN BT_IX

select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';

LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
---------- ------------- ---------- -------------
2 1 1 1

REM T10: After session altered "_optim_peek_user_binds" = FALSE
REM and query re-executed.

select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';

LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
---------- ------------- ---------- -------------
3 1 2 2

select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v' order by plan_hash_value;

OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX FAST FULL SCAN BT_IX
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX RANGE SCAN BT_IX

7 comments:

Sudhi said...

Nice one.

It would be lot easier to read it if there was more width. Currently it looks a bit crampy and some of the sql text/comments are going over the darker area :)

Roderick said...

Sudhi, thanks for the heads up. What browser are you viewing with? Firefox 1.5 and Netscape 7.1 seems to show this okay (though with very tiny text). IE messes up the screen such that the fonts are larger but the rightmost "gutter" column is pushed way down. I might be able to fix that by shrinking the font. None of my browsers show an overflow.

I might try to tweak the blog template over the next few weeks. Ideally, I wanted to display the listings side by side, but not sure there's enough room for that even if I try to use the full screen width of a standard monitor.

Sudhi said...

Actually you are right. My FireFox 1.0.6 also shows everything without spilling it out, but the text is so tiny that I just couldn't finish a paragraph with it. Thats when I did a font increase to get a similar effect of IE.

Are we going to get more of "ORCL" prefixed entries??

Sudhi said...

What release did you try this? At least in 10.2.0.1 bind_peeking is truned off...isin't it? I had to turn it on explicitly to get the change in plan.

Roderick said...

I did my test on a fairly empty 10gR2 RAC system on Solaris. "_optim_peek_user_binds" is set to true if optimizer_features_enable = 9.0.0 or higher. Probably should have mentioned that somewhere. Thanks.

Anonymous said...

Hey Roderick! Always fun reading with these kinds of cases. Reminds me of some issues I saw a long time ago on an apps db OM schema where the DBA was granting privs to developers on the prod system (seeing lots of activity on the table) hanging out for hours.

Good to see you outside the firewall. . .

Unknown said...

Nice one.

But when I try to try this on Oracle Database 10g Enterprise Edition Release 10.2.0.2.0

I could not able to reproduce 2 different plan hash value.

it was able to do mulitple version count but not multiple plan.

Please advise.
thanks