AI-SDLC Metrics and Observability Specification
AI-SDLC Metrics and Observability Specification
Document type: Normative Status: Draft Spec version: v1alpha1
Table of Contents
- Introduction
- Metric Categories
- OpenTelemetry Integration
- Provenance Tracking
- Audit Trail Requirements
1. Introduction
Beyond DORA's four keys, AI-augmented development requires purpose-built measurements. This document defines the AI-SDLC metrics framework, OpenTelemetry integration conventions, provenance tracking requirements, and audit trail specifications.
Metrics serve two purposes in the AI-SDLC Framework:
- Governance — Metrics drive promotion and demotion decisions in the autonomy system
- Observability — Metrics provide visibility into the health and performance of AI-augmented development workflows
2. Metric Categories
The framework defines five metric categories. Implementations MUST support collecting and reporting metrics from at least the first three categories (Task Effectiveness, Human-in-Loop, Code Quality).
2.1 Task Effectiveness
Metrics measuring how effectively agents complete assigned tasks.
| Metric | Name | Type | Unit | Description |
|---|---|---|---|---|
ai_sdlc.task.success_rate | Agent success rate | Gauge | Ratio (0-1) | Tasks completed successfully / tasks assigned |
ai_sdlc.task.completion_time | Task completion time | Histogram | Seconds | Time from task assignment to completion |
ai_sdlc.task.time_vs_baseline | Time vs. human baseline | Gauge | Ratio | Agent completion time / human baseline for equivalent tasks |
ai_sdlc.task.resolution_time | Time-to-resolution | Histogram | Seconds | Time from task creation to resolution, by complexity tier |
Dimensions:
agent— Agent name (AgentRole reference)complexity_tier— low, medium, high, criticaltask_type— implementation, review, testing, deployment
2.2 Human-in-Loop Indicators
Metrics measuring the frequency and nature of human involvement.
| Metric | Name | Type | Unit | Description |
|---|---|---|---|---|
ai_sdlc.human.intervention_rate | Human intervention rate | Gauge | Ratio (0-1) | Tasks requiring human intervention / total tasks |
ai_sdlc.human.escalation_count | Escalation frequency | Counter | Count | Number of escalations from agent to human |
ai_sdlc.human.override_rate | Override rate | Gauge | Ratio (0-1) | Quality gate overrides / total evaluations |
Dimensions:
agent— Agent nameautonomy_level— 0, 1, 2, 3reason— Reason for intervention/escalation
2.3 Code Quality
Metrics measuring the quality of AI-generated code.
| Metric | Name | Type | Unit | Description |
|---|---|---|---|---|
ai_sdlc.code.acceptance_rate | Acceptance rate | Gauge | Ratio (0-1) | PRs accepted without modification / total PRs. Baseline: 0.27-0.30 |
ai_sdlc.code.defect_density | Defect density | Gauge | Defects/KLOC | Defects per thousand lines of code, by author type |
ai_sdlc.code.churn_rate | Churn rate | Gauge | Ratio (0-1) | Lines changed within 14 days of initial commit / total lines. AI baseline: ~0.41 higher than human |
ai_sdlc.code.security_pass_rate | Security scan pass rate | Gauge | Ratio (0-1) | PRs passing security scan / total PRs, by author type |
Dimensions:
author_type— ai-agent, humanagent— Agent name (when author_type is ai-agent)language— Programming languagerepository— Repository identifier
2.4 Economic Efficiency
Metrics measuring the cost-effectiveness of AI agent usage.
| Metric | Name | Type | Unit | Description |
|---|---|---|---|---|
ai_sdlc.cost.per_task | Cost per task | Histogram | USD | Total cost per task (tokens + compute + human review time) |
ai_sdlc.cost.model_usage_mix | Model usage mix | Gauge | Ratio (0-1) | Percentage of tasks using each model tier (cheap vs. expensive) |
ai_sdlc.cost.cache_hit_rate | Cache hit rate | Gauge | Ratio (0-1) | Cache hits / total requests to AI services |
ai_sdlc.cost.tco_per_feature | TCO per feature | Histogram | USD | Total cost of ownership per feature delivered |
ai_sdlc.cost.tokens.input | Input tokens | Counter | tokens | Total input tokens consumed |
ai_sdlc.cost.tokens.output | Output tokens | Counter | tokens | Total output tokens generated |
ai_sdlc.cost.tokens.cache_read | Cache read tokens | Counter | tokens | Total tokens served from cache |
ai_sdlc.cost.token_cost | Token cost | Histogram | USD | Cost attributed to token usage per execution |
ai_sdlc.cost.human_review_cost | Human review cost | Histogram | USD | Cost of human review time (configurable hourly rate per reviewer role) |
ai_sdlc.cost.compute_cost | Compute cost | Histogram | USD | Non-token compute cost per execution (e.g., sandbox, CI) |
ai_sdlc.cost.cache_savings | Cache savings | Counter | USD | Cumulative cost savings from cache hits (credited to benefiting agent) |
ai_sdlc.cost.budget_consumed | Budget consumed | Gauge | Ratio (0-1) | Fraction of budget consumed in current period |
ai_sdlc.cost.budget_remaining | Budget remaining | Gauge | USD | Remaining budget in current period |
ai_sdlc.cost.cost_per_line | Cost per line | Histogram | USD/line | Cost per line of code produced |
ai_sdlc.cost.cost_vs_estimate | Cost vs estimate | Histogram | Ratio | Actual cost / estimated cost |
ai_sdlc.cost.retry_waste | Retry waste | Counter | USD | Cumulative cost of failed attempts that were retried |
ai_sdlc.cost.circuit_breaker_saves | Circuit breaker saves | Counter | USD | Estimated cost saved by circuit breaker interruptions |
Dimensions:
model— Model identifieragent— Agent nameteam— Team/namespacestage— Pipeline stage namecomplexity— Complexity tier (low, medium, high, critical)repository— Repository identifieroutcome— Execution outcome (success, failure, aborted)
2.5 Autonomy Trajectory
Metrics tracking the progression of agent autonomy over time.
| Metric | Name | Type | Unit | Description |
|---|---|---|---|---|
ai_sdlc.autonomy.level | Autonomy level | Gauge | Level (0-3) | Current autonomy level per agent |
ai_sdlc.autonomy.complexity_handled | Complexity handled | Histogram | Score (1-10) | Distribution of task complexity scores handled at each level |
ai_sdlc.autonomy.intervention_trend | Intervention rate trend | Gauge | Ratio (0-1) | Rolling average of intervention rate (should decrease over time) |
ai_sdlc.autonomy.time_to_promotion | Time-to-promotion | Gauge | Seconds | Time spent at current level before promotion |
Dimensions:
agent— Agent namefrom_level— Previous level (for promotions)to_level— New level (for promotions)
3. OpenTelemetry Integration
The framework SHOULD define semantic conventions for AI-SDLC observability, extending OpenTelemetry's GenAI semantic conventions. All metric, trace, and log attribute names MUST use the ai_sdlc.* namespace prefix.
3.1 Traces
Implementations SHOULD generate spans for the following operations:
| Span Name | Description | Attributes |
|---|---|---|
ai_sdlc.pipeline.stage | One span per pipeline stage execution | pipeline, stage, agent |
ai_sdlc.agent.task | One span per agent task execution | agent, task_type, complexity |
ai_sdlc.gate.evaluation | One span per quality gate evaluation | gate, enforcement, result |
ai_sdlc.reconciliation.cycle | One span per reconciliation cycle | resource_kind, resource_name, result |
ai_sdlc.handoff | One span per agent handoff | source_agent, target_agent, contract_id |
Spans SHOULD be linked into traces following the pipeline execution flow:
pipeline.stage (implement)
└─→ agent.task (code-agent)
├─→ gate.evaluation (test-coverage)
└─→ gate.evaluation (security-scan)
└─→ handoff (code-agent → reviewer-agent)
pipeline.stage (review)
└─→ agent.task (reviewer-agent)
└─→ gate.evaluation (human-review)3.2 Metrics
Implementations MUST expose metrics using the names and types defined in Section 2. Metrics SHOULD be exportable via OpenTelemetry Protocol (OTLP).
Instrument types:
- Gauge — For values that can go up and down (e.g., autonomy level, rates)
- Counter — For values that only increase (e.g., escalation count)
- Histogram — For distributions (e.g., task completion time, cost per task)
3.3 Logs
Implementations MUST produce structured logs for the following events:
| Event | Required Fields | Description |
|---|---|---|
reconciliation.decision | resource, action, reason, result | Every reconciliation decision |
autonomy.promotion | agent, from_level, to_level, criteria_met | Agent promoted |
autonomy.demotion | agent, from_level, to_level, trigger | Agent demoted |
gate.override | gate, actor, role, justification | Quality gate overridden |
gate.failure | gate, enforcement, reason | Quality gate failed |
handoff.completed | source, target, contract, validation_result | Agent handoff completed |
Logs MUST be structured (JSON) and SHOULD be exportable via OTLP.
4. Provenance Tracking
Every AI-generated artifact MUST record provenance metadata. Provenance enables attribution, auditability, and regulatory compliance.
4.1 Required Fields
| Field | Type | Required | Description |
|---|---|---|---|
model | string | MUST | Model identifier (e.g., claude-sonnet-4-5-20250929). |
tool | string | MUST | Tool that generated the artifact (e.g., claude-code@1.2.0). |
promptHash | string | MUST | SHA-256 hash of the input prompt. |
timestamp | string (date-time) | MUST | ISO 8601 generation time. |
humanReviewer | string | MAY | Identity of the human who reviewed the artifact. |
reviewDecision | string | MAY | One of: approved, rejected, revised. |
cost | CostReceipt | MAY | Cost breakdown for this artifact. See RFC-0004. |
CostReceipt Fields
| Field | Type | Required | Description |
|---|---|---|---|
totalCost | number | MUST | Total cost of producing this artifact. |
currency | string | MUST | Currency code (e.g., USD). |
breakdown | CostBreakdown | MUST | Itemized cost breakdown. |
execution | ExecutionCostDetail | MAY | Detailed execution metrics. |
CostBreakdown Fields
| Field | Type | Required | Description |
|---|---|---|---|
tokenCost | number | MUST | Cost attributed to token usage. |
cacheSavings | number | MAY | Cost saved via cache hits. |
computeCost | number | MAY | Non-token compute cost. |
humanReviewCost | number | MAY | Cost of human review time. |
ExecutionCostDetail Fields
| Field | Type | Required | Description |
|---|---|---|---|
inputTokens | number | MUST | Total input tokens consumed. |
outputTokens | number | MUST | Total output tokens generated. |
cacheReadTokens | number | MAY | Tokens served from cache. |
modelCalls | number | MAY | Number of model API calls. |
wallClockSeconds | number | MAY | Wall clock execution time in seconds. |
retryCount | number | MAY | Number of retried attempts. |
4.2 Storage
Provenance metadata SHOULD be stored:
- In commit metadata (git trailers or notes)
- In PR descriptions or comments
- In a dedicated provenance store accessible via API
Implementations MUST ensure provenance records are immutable after creation.
4.3 Attribution
When QualityGate rules include requireAttribution: true, implementations MUST verify that provenance metadata is present and complete before admitting the artifact.
5. Audit Trail Requirements
Every action in the system MUST produce an immutable, tamper-evident audit log entry.
5.1 Required Audit Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | MUST | Unique audit entry identifier. |
timestamp | string (date-time) | MUST | When the action occurred. |
actor | string | MUST | Identity of the actor (human or agent). |
actorType | string | MUST | One of: human, ai-agent, bot, service-account. |
action | string | MUST | Action performed (e.g., create, update, approve, override, promote, demote). |
resource | string | MUST | Resource affected (kind/namespace/name). |
policyEvaluated | string | MAY | Policy or gate that was evaluated. |
decision | string | MUST | Decision rendered (e.g., allowed, denied, overridden). |
details | object | MAY | Additional context (justification, metric values, etc.). |
5.2 Immutability
Audit log entries MUST NOT be modifiable after creation. Implementations SHOULD use append-only storage with integrity verification (e.g., hash chains, write-once storage).
5.3 Retention
Implementations SHOULD support configurable retention policies for audit logs. The default retention period SHOULD be at least 12 months to support regulatory compliance.
5.4 Access
Audit logs MUST be accessible to authorized users via API. Implementations SHOULD support filtering by actor, action, resource, and time range.