AI-SDLC Metrics and Observability Specification

Document type: Normative Status: Draft Spec version: v1alpha1

Introduction
Metric Categories
OpenTelemetry Integration
Provenance Tracking
Audit Trail Requirements

1. Introduction

Beyond DORA's four keys, AI-augmented development requires purpose-built measurements. This document defines the AI-SDLC metrics framework, OpenTelemetry integration conventions, provenance tracking requirements, and audit trail specifications.

Metrics serve two purposes in the AI-SDLC Framework:

Governance — Metrics drive promotion and demotion decisions in the autonomy system
Observability — Metrics provide visibility into the health and performance of AI-augmented development workflows

2. Metric Categories

The framework defines five metric categories. Implementations MUST support collecting and reporting metrics from at least the first three categories (Task Effectiveness, Human-in-Loop, Code Quality).

2.1 Task Effectiveness

Metrics measuring how effectively agents complete assigned tasks.

Metric	Name	Type	Unit	Description
`ai_sdlc.task.success_rate`	Agent success rate	Gauge	Ratio (0-1)	Tasks completed successfully / tasks assigned
`ai_sdlc.task.completion_time`	Task completion time	Histogram	Seconds	Time from task assignment to completion
`ai_sdlc.task.time_vs_baseline`	Time vs. human baseline	Gauge	Ratio	Agent completion time / human baseline for equivalent tasks
`ai_sdlc.task.resolution_time`	Time-to-resolution	Histogram	Seconds	Time from task creation to resolution, by complexity tier

Dimensions:

agent — Agent name (AgentRole reference)
complexity_tier — low, medium, high, critical
task_type — implementation, review, testing, deployment

2.2 Human-in-Loop Indicators

Metrics measuring the frequency and nature of human involvement.

Metric	Name	Type	Unit	Description
`ai_sdlc.human.intervention_rate`	Human intervention rate	Gauge	Ratio (0-1)	Tasks requiring human intervention / total tasks
`ai_sdlc.human.escalation_count`	Escalation frequency	Counter	Count	Number of escalations from agent to human
`ai_sdlc.human.override_rate`	Override rate	Gauge	Ratio (0-1)	Quality gate overrides / total evaluations

Dimensions:

agent — Agent name
autonomy_level — 0, 1, 2, 3
reason — Reason for intervention/escalation

2.3 Code Quality

Metrics measuring the quality of AI-generated code.

Metric	Name	Type	Unit	Description
`ai_sdlc.code.acceptance_rate`	Acceptance rate	Gauge	Ratio (0-1)	PRs accepted without modification / total PRs. Baseline: 0.27-0.30
`ai_sdlc.code.defect_density`	Defect density	Gauge	Defects/KLOC	Defects per thousand lines of code, by author type
`ai_sdlc.code.churn_rate`	Churn rate	Gauge	Ratio (0-1)	Lines changed within 14 days of initial commit / total lines. AI baseline: ~0.41 higher than human
`ai_sdlc.code.security_pass_rate`	Security scan pass rate	Gauge	Ratio (0-1)	PRs passing security scan / total PRs, by author type

Dimensions:

author_type — ai-agent, human
agent — Agent name (when author_type is ai-agent)
language — Programming language
repository — Repository identifier

2.4 Economic Efficiency

Metrics measuring the cost-effectiveness of AI agent usage.

Metric	Name	Type	Unit	Description
`ai_sdlc.cost.per_task`	Cost per task	Histogram	USD	Total cost per task (tokens + compute + human review time)
`ai_sdlc.cost.model_usage_mix`	Model usage mix	Gauge	Ratio (0-1)	Percentage of tasks using each model tier (cheap vs. expensive)
`ai_sdlc.cost.cache_hit_rate`	Cache hit rate	Gauge	Ratio (0-1)	Cache hits / total requests to AI services
`ai_sdlc.cost.tco_per_feature`	TCO per feature	Histogram	USD	Total cost of ownership per feature delivered
`ai_sdlc.cost.tokens.input`	Input tokens	Counter	tokens	Total input tokens consumed
`ai_sdlc.cost.tokens.output`	Output tokens	Counter	tokens	Total output tokens generated
`ai_sdlc.cost.tokens.cache_read`	Cache read tokens	Counter	tokens	Total tokens served from cache
`ai_sdlc.cost.token_cost`	Token cost	Histogram	USD	Cost attributed to token usage per execution
`ai_sdlc.cost.human_review_cost`	Human review cost	Histogram	USD	Cost of human review time (configurable hourly rate per reviewer role)
`ai_sdlc.cost.compute_cost`	Compute cost	Histogram	USD	Non-token compute cost per execution (e.g., sandbox, CI)
`ai_sdlc.cost.cache_savings`	Cache savings	Counter	USD	Cumulative cost savings from cache hits (credited to benefiting agent)
`ai_sdlc.cost.budget_consumed`	Budget consumed	Gauge	Ratio (0-1)	Fraction of budget consumed in current period
`ai_sdlc.cost.budget_remaining`	Budget remaining	Gauge	USD	Remaining budget in current period
`ai_sdlc.cost.cost_per_line`	Cost per line	Histogram	USD/line	Cost per line of code produced
`ai_sdlc.cost.cost_vs_estimate`	Cost vs estimate	Histogram	Ratio	Actual cost / estimated cost
`ai_sdlc.cost.retry_waste`	Retry waste	Counter	USD	Cumulative cost of failed attempts that were retried
`ai_sdlc.cost.circuit_breaker_saves`	Circuit breaker saves	Counter	USD	Estimated cost saved by circuit breaker interruptions

Dimensions:

model — Model identifier
agent — Agent name
team — Team/namespace
stage — Pipeline stage name
complexity — Complexity tier (low, medium, high, critical)
repository — Repository identifier
outcome — Execution outcome (success, failure, aborted)

2.5 Autonomy Trajectory

Metrics tracking the progression of agent autonomy over time.

Metric	Name	Type	Unit	Description
`ai_sdlc.autonomy.level`	Autonomy level	Gauge	Level (0-3)	Current autonomy level per agent
`ai_sdlc.autonomy.complexity_handled`	Complexity handled	Histogram	Score (1-10)	Distribution of task complexity scores handled at each level
`ai_sdlc.autonomy.intervention_trend`	Intervention rate trend	Gauge	Ratio (0-1)	Rolling average of intervention rate (should decrease over time)
`ai_sdlc.autonomy.time_to_promotion`	Time-to-promotion	Gauge	Seconds	Time spent at current level before promotion

Dimensions:

agent — Agent name
from_level — Previous level (for promotions)
to_level — New level (for promotions)

3. OpenTelemetry Integration

The framework SHOULD define semantic conventions for AI-SDLC observability, extending OpenTelemetry's GenAI semantic conventions. All metric, trace, and log attribute names MUST use the ai_sdlc.* namespace prefix.

3.1 Traces

Implementations SHOULD generate spans for the following operations:

Span Name	Description	Attributes
`ai_sdlc.pipeline.stage`	One span per pipeline stage execution	`pipeline`, `stage`, `agent`
`ai_sdlc.agent.task`	One span per agent task execution	`agent`, `task_type`, `complexity`
`ai_sdlc.gate.evaluation`	One span per quality gate evaluation	`gate`, `enforcement`, `result`
`ai_sdlc.reconciliation.cycle`	One span per reconciliation cycle	`resource_kind`, `resource_name`, `result`
`ai_sdlc.handoff`	One span per agent handoff	`source_agent`, `target_agent`, `contract_id`

Spans SHOULD be linked into traces following the pipeline execution flow:

pipeline.stage (implement)
  └─→ agent.task (code-agent)
       ├─→ gate.evaluation (test-coverage)
       └─→ gate.evaluation (security-scan)
  └─→ handoff (code-agent → reviewer-agent)
pipeline.stage (review)
  └─→ agent.task (reviewer-agent)
       └─→ gate.evaluation (human-review)

3.2 Metrics

Implementations MUST expose metrics using the names and types defined in Section 2. Metrics SHOULD be exportable via OpenTelemetry Protocol (OTLP).

Instrument types:

Gauge — For values that can go up and down (e.g., autonomy level, rates)
Counter — For values that only increase (e.g., escalation count)
Histogram — For distributions (e.g., task completion time, cost per task)

3.3 Logs

Implementations MUST produce structured logs for the following events:

Event	Required Fields	Description
`reconciliation.decision`	resource, action, reason, result	Every reconciliation decision
`autonomy.promotion`	agent, from_level, to_level, criteria_met	Agent promoted
`autonomy.demotion`	agent, from_level, to_level, trigger	Agent demoted
`gate.override`	gate, actor, role, justification	Quality gate overridden
`gate.failure`	gate, enforcement, reason	Quality gate failed
`handoff.completed`	source, target, contract, validation_result	Agent handoff completed

Logs MUST be structured (JSON) and SHOULD be exportable via OTLP.

4. Provenance Tracking

Every AI-generated artifact MUST record provenance metadata. Provenance enables attribution, auditability, and regulatory compliance.

4.1 Required Fields

Field	Type	Required	Description
`model`	string	MUST	Model identifier (e.g., `claude-sonnet-4-5-20250929`).
`tool`	string	MUST	Tool that generated the artifact (e.g., `claude-code@1.2.0`).
`promptHash`	string	MUST	SHA-256 hash of the input prompt.
`timestamp`	string (date-time)	MUST	ISO 8601 generation time.
`humanReviewer`	string	MAY	Identity of the human who reviewed the artifact.
`reviewDecision`	string	MAY	One of: `approved`, `rejected`, `revised`.
`cost`	CostReceipt	MAY	Cost breakdown for this artifact. See RFC-0004.

CostReceipt Fields

Field	Type	Required	Description
`totalCost`	number	MUST	Total cost of producing this artifact.
`currency`	string	MUST	Currency code (e.g., USD).
`breakdown`	CostBreakdown	MUST	Itemized cost breakdown.
`execution`	ExecutionCostDetail	MAY	Detailed execution metrics.

CostBreakdown Fields

Field	Type	Required	Description
`tokenCost`	number	MUST	Cost attributed to token usage.
`cacheSavings`	number	MAY	Cost saved via cache hits.
`computeCost`	number	MAY	Non-token compute cost.
`humanReviewCost`	number	MAY	Cost of human review time.

ExecutionCostDetail Fields

Field	Type	Required	Description
`inputTokens`	number	MUST	Total input tokens consumed.
`outputTokens`	number	MUST	Total output tokens generated.
`cacheReadTokens`	number	MAY	Tokens served from cache.
`modelCalls`	number	MAY	Number of model API calls.
`wallClockSeconds`	number	MAY	Wall clock execution time in seconds.
`retryCount`	number	MAY	Number of retried attempts.

4.2 Storage

Provenance metadata SHOULD be stored:

In commit metadata (git trailers or notes)
In PR descriptions or comments
In a dedicated provenance store accessible via API

Implementations MUST ensure provenance records are immutable after creation.

4.3 Attribution

When QualityGate rules include requireAttribution: true, implementations MUST verify that provenance metadata is present and complete before admitting the artifact.

5. Audit Trail Requirements

Every action in the system MUST produce an immutable, tamper-evident audit log entry.

5.1 Required Audit Fields

Field	Type	Required	Description
`id`	string	MUST	Unique audit entry identifier.
`timestamp`	string (date-time)	MUST	When the action occurred.
`actor`	string	MUST	Identity of the actor (human or agent).
`actorType`	string	MUST	One of: `human`, `ai-agent`, `bot`, `service-account`.
`action`	string	MUST	Action performed (e.g., `create`, `update`, `approve`, `override`, `promote`, `demote`).
`resource`	string	MUST	Resource affected (kind/namespace/name).
`policyEvaluated`	string	MAY	Policy or gate that was evaluated.
`decision`	string	MUST	Decision rendered (e.g., `allowed`, `denied`, `overridden`).
`details`	object	MAY	Additional context (justification, metric values, etc.).

AI-SDLC Metrics and Observability Specification

Table of Contents

1. Introduction

2. Metric Categories

2.1 Task Effectiveness

2.2 Human-in-Loop Indicators

2.3 Code Quality

2.4 Economic Efficiency

2.5 Autonomy Trajectory

3. OpenTelemetry Integration

3.1 Traces

3.2 Metrics

3.3 Logs

4. Provenance Tracking

4.1 Required Fields

CostReceipt Fields

CostBreakdown Fields

ExecutionCostDetail Fields

4.2 Storage

4.3 Attribution

5. Audit Trail Requirements

5.1 Required Audit Fields

5.2 Immutability

5.3 Retention

5.4 Access

On this page