I recently faced an issue when one of my production database which was running very slow.
A simple select statement on dictionary tables were taking too much time.
Size of one of my tablespace was reaching 100% and the "Alter Tablespace add datafile" command juts to add 1 GB of datafile with autoextend on also went into hang mode.
My initial investigation started with checking CPU and memory of the server. CPU was 80% free and also there was enough memory available on the server.
Next step was to check the active wait events on the database using below query:
select event,count(*) from gv$session where event not like 'Streams%' and wait_class <> 'Idle' group by event order by 2/
Results of the above query displayed that almost all of my database sessions were waiting on wait event "log file switch (checkpoint incomplete)". Even my session from where I executed the command to add the datafile was waiting on this event. Along with this there was another wait event "db file async I/O submit". These wait events directed me that there is something that is related to redo generation or redo log files. Size of the redo log files was 2.5 GB each which was good enough. Also the size would not have been problem as the database was working fine since a long time and suddenly faced this slowness problem.
Looking into the messages in database alert log file shows that there are frequent log switches, i.e. every 40 seconds which was very high. Oracle recommends to have one log switch in every 20-30 minutes for optimal performance.
Executed the below query to identify which session is generating the high redo.
set linesize 200
col event for a21
col action for a10
col module for a30
col username for a10
col name for a15
select /*+ ordered use_nl(s n) */
substr(n.name, 1, 31) event
,s.value
,n.statistic#,s.sid,a.action,a.module,a.username
from v$statname n,v$sesstat s,v$session a
where n.statistic# = s.statistic#(+)
and a.username is not null and a.sid=s.sid
and n.name in ('redo entries','redo size')
order by 2;
After looking into the results it was clearly visible that the value for redo size and entries for one of the session was very high.
The next step was to identify the session sid provided by the above query.
Executing a query against v$session for the sid provided in the above query results displayed that it is a database job and the next query to v$dba_jobs_running displayed that the job is currently running. We contacted the development team and found that the job was recently created during the last change that went into production.
In order to bring the performance back to normal the job session was killed and the job was broken in order to avoid its future runs.
Immediately after the above action performance came back to normal. The command to add the datafile was also completed.
A simple select statement on dictionary tables were taking too much time.
Size of one of my tablespace was reaching 100% and the "Alter Tablespace add datafile" command juts to add 1 GB of datafile with autoextend on also went into hang mode.
My initial investigation started with checking CPU and memory of the server. CPU was 80% free and also there was enough memory available on the server.
Next step was to check the active wait events on the database using below query:
select event,count(*) from gv$session where event not like 'Streams%' and wait_class <> 'Idle' group by event order by 2/
Results of the above query displayed that almost all of my database sessions were waiting on wait event "log file switch (checkpoint incomplete)". Even my session from where I executed the command to add the datafile was waiting on this event. Along with this there was another wait event "db file async I/O submit". These wait events directed me that there is something that is related to redo generation or redo log files. Size of the redo log files was 2.5 GB each which was good enough. Also the size would not have been problem as the database was working fine since a long time and suddenly faced this slowness problem.
Looking into the messages in database alert log file shows that there are frequent log switches, i.e. every 40 seconds which was very high. Oracle recommends to have one log switch in every 20-30 minutes for optimal performance.
Executed the below query to identify which session is generating the high redo.
set linesize 200
col event for a21
col action for a10
col module for a30
col username for a10
col name for a15
select /*+ ordered use_nl(s n) */
substr(n.name, 1, 31) event
,s.value
,n.statistic#,s.sid,a.action,a.module,a.username
from v$statname n,v$sesstat s,v$session a
where n.statistic# = s.statistic#(+)
and a.username is not null and a.sid=s.sid
and n.name in ('redo entries','redo size')
order by 2;
After looking into the results it was clearly visible that the value for redo size and entries for one of the session was very high.
The next step was to identify the session sid provided by the above query.
Executing a query against v$session for the sid provided in the above query results displayed that it is a database job and the next query to v$dba_jobs_running displayed that the job is currently running. We contacted the development team and found that the job was recently created during the last change that went into production.
In order to bring the performance back to normal the job session was killed and the job was broken in order to avoid its future runs.
Immediately after the above action performance came back to normal. The command to add the datafile was also completed.
Comments
Post a Comment