Performance diagnostics without a methodology is archaeology — you dig until you find something interesting, not necessarily something important. Oracle’s AWR and ASH provide an extraordinarily detailed record of database activity, but interpreting them requires a structured approach. This article presents a top-down methodology that moves from macro to micro: DB Time → Wait Profile → Top SQL → Root Cause.

The Foundation: DB Time

DB Time is the total time all sessions spent working in the database (on CPU or waiting for database resources) during a snapshot period. It is the single most important number in the AWR report.

-- DB Time from AWR snapshots
SELECT s.snap_id,
       s.begin_interval_time,
       s.end_interval_time,
       ROUND(m.value / 1000000 / 60, 2) AS db_time_minutes
FROM dba_hist_sys_time_model m
JOIN dba_hist_snapshot s ON m.snap_id = s.snap_id AND m.dbid = s.dbid
WHERE m.stat_name = 'DB time'
  AND s.begin_interval_time >= SYSDATE - 1
ORDER BY s.snap_id DESC;

DB Time as a Capacity Metric

DB Time / elapsed wall-clock time = average active sessions (AAS).

AAS = DB Time (seconds) / Elapsed Time (seconds)

If AAS > number of CPUs, the database is overloaded (CPU is the bottleneck or sessions are queued). If AAS ≪ number of CPUs, the database has headroom. This single ratio puts all performance numbers in context.


Step 1: Wait Event Profile

Decompose DB Time into CPU time and wait time categories. This tells you whether time is being spent usefully (on CPU) or blocking (waiting).

-- Wait event profile from AWR (one snapshot interval)
SELECT event,
       waits,
       ROUND(time_waited_micro / 1000000, 2) AS total_wait_secs,
       ROUND(time_waited_micro / 1000000 / waits * 1000, 2) AS avg_wait_ms,
       ROUND(time_waited_micro / (SELECT SUM(time_waited_micro)
                                   FROM dba_hist_system_event e2
                                   WHERE e2.snap_id = e.snap_id
                                     AND e2.dbid = e.dbid) * 100, 1) AS pct_total
FROM dba_hist_system_event e
WHERE snap_id = (SELECT MAX(snap_id) - 1 FROM dba_hist_snapshot)
  AND wait_class NOT IN ('Idle')
ORDER BY time_waited_micro DESC
FETCH FIRST 10 ROWS ONLY;

Interpreting Wait Class Distribution

Wait Class Common Events What It Signals
User I/O db file sequential read, db file scattered read Index scans, full table scans
Commit log file sync High commit rate; redo bottleneck
Concurrency buffer busy waits, enq: TX - row lock contention Hot blocks; lock contention
System I/O log file parallel write, control file I/O Redo write performance
Network SQL*Net more data to client Result set transfer overhead
Other gc cr blocks received, gc current blocks received RAC Cache Fusion (only in RAC)
Scheduler resmgr: cpu quantum CPU contention; resource manager queuing

If On CPU (not a wait event but reported in the Top 5 Timed Events) is dominant, focus on reducing logical I/Os (unnecessary full scans, missing indexes, suboptimal SQL).


Step 2: ASH for In-flight Diagnostics

AWR snapshots are 30 minutes apart by default. For problems that occur within a snapshot interval (or are happening right now), use Active Session History (ASH) — a rolling in-memory buffer sampled every second.

Real-time ASH: What Is the Database Doing Now?

-- What are active sessions doing right now?
SELECT event,
       COUNT(*) AS active_sessions,
       ROUND(COUNT(*) * 100 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM v$active_session_history
WHERE sample_time >= SYSDATE - 1/1440   -- last 1 minute
  AND session_state = 'WAITING'
GROUP BY event
ORDER BY active_sessions DESC;
-- Which SQL IDs are consuming the most DB Time right now?
SELECT sql_id,
       COUNT(*) AS ash_samples,
       ROUND(COUNT(*) * 10, 0) AS approx_db_time_secs  -- ~1 sample/sec
FROM v$active_session_history
WHERE sample_time >= SYSDATE - 5/1440   -- last 5 minutes
GROUP BY sql_id
ORDER BY ash_samples DESC
FETCH FIRST 10 ROWS ONLY;

Historical ASH from DBA_HIST_ACTIVE_SESS_HISTORY

-- Top wait events for a specific time window (historical)
SELECT ash.event,
       COUNT(*) AS samples,
       ROUND(COUNT(*) / SUM(COUNT(*)) OVER () * 100, 1) AS pct_db_time
FROM dba_hist_active_sess_history ash
WHERE ash.sample_time BETWEEN TO_TIMESTAMP('2026-03-01 09:00', 'YYYY-MM-DD HH24:MI')
                          AND TO_TIMESTAMP('2026-03-01 09:30', 'YYYY-MM-DD HH24:MI')
  AND ash.session_state = 'WAITING'
  AND ash.wait_class != 'Idle'
GROUP BY ash.event
ORDER BY samples DESC
FETCH FIRST 15 ROWS ONLY;

ASH to Identify the “Offending” Time Window

-- Activity timeline in 5-minute buckets (find when the spike occurred)
SELECT TRUNC(sample_time, 'MI') - MOD(EXTRACT(MINUTE FROM sample_time), 5) * INTERVAL '1' MINUTE AS bucket,
       COUNT(*) AS active_sessions
FROM dba_hist_active_sess_history
WHERE sample_time >= SYSDATE - 2   -- last 2 days
GROUP BY TRUNC(sample_time, 'MI') - MOD(EXTRACT(MINUTE FROM sample_time), 5) * INTERVAL '1' MINUTE
ORDER BY bucket;

Step 3: Top SQL Identification

Once you know the dominant wait event, find which SQL statements are responsible.

-- Top SQL by DB Time from AWR (for a specific snapshot range)
SELECT sql.sql_id,
       ROUND(SUM(elapsed_time_delta) / 1000000, 2)          AS elapsed_secs,
       SUM(executions_delta)                                 AS executions,
       ROUND(SUM(elapsed_time_delta) / NULLIF(SUM(executions_delta), 0) / 1000, 2) AS avg_ms,
       ROUND(SUM(buffer_gets_delta) / NULLIF(SUM(executions_delta), 0)) AS avg_lios,
       ROUND(SUM(disk_reads_delta) / NULLIF(SUM(executions_delta), 0))  AS avg_pio
FROM dba_hist_sqlstat sql
JOIN dba_hist_snapshot s ON sql.snap_id = s.snap_id AND sql.dbid = s.dbid
WHERE s.begin_interval_time >= SYSDATE - 1
GROUP BY sql.sql_id
ORDER BY elapsed_secs DESC
FETCH FIRST 20 ROWS ONLY;

Retrieve the SQL Text

SELECT sql_id, sql_text
FROM dba_hist_sqltext
WHERE sql_id = '&sql_id';

Step 4: Execution Plan Analysis

For identified top SQL, retrieve the execution plan and runtime statistics:

-- Current plan with runtime stats (requires the query to be in shared pool)
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('&sql_id', NULL, 'ALLSTATS LAST +PEEKED_BINDS'));
-- Historical plan from AWR
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_AWR('&sql_id', NULL, NULL, 'ALL'));

Reading the Plan: Key Indicators of Problems

  • High E-rows vs A-rows discrepancy: The optimizer’s row estimate is far from actual. Points to stale statistics or bind variable peeking issues.
  • Full table scan on large table without Smart Scan: If not on Exadata, indicates a missing index or poor index usage.
  • Nested loop on large row sources: Driving a large table into a nested loop with many probe iterations.
  • Sort/Hash Join with large temp space: Workarea spill to disk (TEMP tablespace writes).

ADDM: Automated Diagnostic Output

ADDM analyses AWR snapshots automatically and surfaces findings with impact estimates:

-- View latest ADDM task findings
SELECT f.type,
       f.impact_type,
       ROUND(f.impact, 0) AS impact_secs,
       f.message
FROM dba_advisor_findings f
JOIN dba_advisor_tasks t ON f.task_id = t.task_id
WHERE t.advisor_name = 'ADDM'
  AND t.created >= SYSDATE - 1
ORDER BY f.impact DESC
FETCH FIRST 10 ROWS ONLY;

ADDM findings are a useful starting point but not a substitute for manual analysis. ADDM tends to surface SQL-level issues well, but misses application-tier problems, connection management issues, and complex multi-component bottlenecks.


Practical Diagnostic Workflow Summary

  1. Establish baseline: What is the expected AAS for this workload at this time of day?
  2. Identify anomalous period: Use ASH timeline query to find when performance degraded.
  3. Decompose by wait class: What wait class dominates during the anomalous period?
  4. Find top SQL: Which SQL IDs account for the most samples in that wait class/event?
  5. Analyse the plan: Is the execution plan for those SQL IDs correct? Compare against known-good plans if available.
  6. Act: Fix statistics, add indexes, pin SQL baselines, fix application, or tune resource management — depending on what step 5 reveals.

The methodology is the same whether the problem happened an hour ago (ASH) or two weeks ago (AWR). The data is there — the key is following a consistent path from macro to micro.