AI-SDLC
AI-SDLC

AI-SDLC Metrics and Observability Specification

AI-SDLC Metrics and Observability Specification

Document type: Normative Status: Draft Spec version: v1alpha1


Table of Contents

  1. Introduction
  2. Metric Categories
  3. OpenTelemetry Integration
  4. Provenance Tracking
  5. Audit Trail Requirements

1. Introduction

Beyond DORA's four keys, AI-augmented development requires purpose-built measurements. This document defines the AI-SDLC metrics framework, OpenTelemetry integration conventions, provenance tracking requirements, and audit trail specifications.

Metrics serve two purposes in the AI-SDLC Framework:

  1. Governance — Metrics drive promotion and demotion decisions in the autonomy system
  2. Observability — Metrics provide visibility into the health and performance of AI-augmented development workflows

2. Metric Categories

The framework defines five metric categories. Implementations MUST support collecting and reporting metrics from at least the first three categories (Task Effectiveness, Human-in-Loop, Code Quality).

2.1 Task Effectiveness

Metrics measuring how effectively agents complete assigned tasks.

MetricNameTypeUnitDescription
ai_sdlc.task.success_rateAgent success rateGaugeRatio (0-1)Tasks completed successfully / tasks assigned
ai_sdlc.task.completion_timeTask completion timeHistogramSecondsTime from task assignment to completion
ai_sdlc.task.time_vs_baselineTime vs. human baselineGaugeRatioAgent completion time / human baseline for equivalent tasks
ai_sdlc.task.resolution_timeTime-to-resolutionHistogramSecondsTime from task creation to resolution, by complexity tier

Dimensions:

  • agent — Agent name (AgentRole reference)
  • complexity_tier — low, medium, high, critical
  • task_type — implementation, review, testing, deployment

2.2 Human-in-Loop Indicators

Metrics measuring the frequency and nature of human involvement.

MetricNameTypeUnitDescription
ai_sdlc.human.intervention_rateHuman intervention rateGaugeRatio (0-1)Tasks requiring human intervention / total tasks
ai_sdlc.human.escalation_countEscalation frequencyCounterCountNumber of escalations from agent to human
ai_sdlc.human.override_rateOverride rateGaugeRatio (0-1)Quality gate overrides / total evaluations

Dimensions:

  • agent — Agent name
  • autonomy_level — 0, 1, 2, 3
  • reason — Reason for intervention/escalation

2.3 Code Quality

Metrics measuring the quality of AI-generated code.

MetricNameTypeUnitDescription
ai_sdlc.code.acceptance_rateAcceptance rateGaugeRatio (0-1)PRs accepted without modification / total PRs. Baseline: 0.27-0.30
ai_sdlc.code.defect_densityDefect densityGaugeDefects/KLOCDefects per thousand lines of code, by author type
ai_sdlc.code.churn_rateChurn rateGaugeRatio (0-1)Lines changed within 14 days of initial commit / total lines. AI baseline: ~0.41 higher than human
ai_sdlc.code.security_pass_rateSecurity scan pass rateGaugeRatio (0-1)PRs passing security scan / total PRs, by author type

Dimensions:

  • author_type — ai-agent, human
  • agent — Agent name (when author_type is ai-agent)
  • language — Programming language
  • repository — Repository identifier

2.4 Economic Efficiency

Metrics measuring the cost-effectiveness of AI agent usage.

MetricNameTypeUnitDescription
ai_sdlc.cost.per_taskCost per taskHistogramUSDTotal cost per task (tokens + compute + human review time)
ai_sdlc.cost.model_usage_mixModel usage mixGaugeRatio (0-1)Percentage of tasks using each model tier (cheap vs. expensive)
ai_sdlc.cost.cache_hit_rateCache hit rateGaugeRatio (0-1)Cache hits / total requests to AI services
ai_sdlc.cost.tco_per_featureTCO per featureHistogramUSDTotal cost of ownership per feature delivered
ai_sdlc.cost.tokens.inputInput tokensCountertokensTotal input tokens consumed
ai_sdlc.cost.tokens.outputOutput tokensCountertokensTotal output tokens generated
ai_sdlc.cost.tokens.cache_readCache read tokensCountertokensTotal tokens served from cache
ai_sdlc.cost.token_costToken costHistogramUSDCost attributed to token usage per execution
ai_sdlc.cost.human_review_costHuman review costHistogramUSDCost of human review time (configurable hourly rate per reviewer role)
ai_sdlc.cost.compute_costCompute costHistogramUSDNon-token compute cost per execution (e.g., sandbox, CI)
ai_sdlc.cost.cache_savingsCache savingsCounterUSDCumulative cost savings from cache hits (credited to benefiting agent)
ai_sdlc.cost.budget_consumedBudget consumedGaugeRatio (0-1)Fraction of budget consumed in current period
ai_sdlc.cost.budget_remainingBudget remainingGaugeUSDRemaining budget in current period
ai_sdlc.cost.cost_per_lineCost per lineHistogramUSD/lineCost per line of code produced
ai_sdlc.cost.cost_vs_estimateCost vs estimateHistogramRatioActual cost / estimated cost
ai_sdlc.cost.retry_wasteRetry wasteCounterUSDCumulative cost of failed attempts that were retried
ai_sdlc.cost.circuit_breaker_savesCircuit breaker savesCounterUSDEstimated cost saved by circuit breaker interruptions

Dimensions:

  • model — Model identifier
  • agent — Agent name
  • team — Team/namespace
  • stage — Pipeline stage name
  • complexity — Complexity tier (low, medium, high, critical)
  • repository — Repository identifier
  • outcome — Execution outcome (success, failure, aborted)

2.5 Autonomy Trajectory

Metrics tracking the progression of agent autonomy over time.

MetricNameTypeUnitDescription
ai_sdlc.autonomy.levelAutonomy levelGaugeLevel (0-3)Current autonomy level per agent
ai_sdlc.autonomy.complexity_handledComplexity handledHistogramScore (1-10)Distribution of task complexity scores handled at each level
ai_sdlc.autonomy.intervention_trendIntervention rate trendGaugeRatio (0-1)Rolling average of intervention rate (should decrease over time)
ai_sdlc.autonomy.time_to_promotionTime-to-promotionGaugeSecondsTime spent at current level before promotion

Dimensions:

  • agent — Agent name
  • from_level — Previous level (for promotions)
  • to_level — New level (for promotions)

3. OpenTelemetry Integration

The framework SHOULD define semantic conventions for AI-SDLC observability, extending OpenTelemetry's GenAI semantic conventions. All metric, trace, and log attribute names MUST use the ai_sdlc.* namespace prefix.

3.1 Traces

Implementations SHOULD generate spans for the following operations:

Span NameDescriptionAttributes
ai_sdlc.pipeline.stageOne span per pipeline stage executionpipeline, stage, agent
ai_sdlc.agent.taskOne span per agent task executionagent, task_type, complexity
ai_sdlc.gate.evaluationOne span per quality gate evaluationgate, enforcement, result
ai_sdlc.reconciliation.cycleOne span per reconciliation cycleresource_kind, resource_name, result
ai_sdlc.handoffOne span per agent handoffsource_agent, target_agent, contract_id

Spans SHOULD be linked into traces following the pipeline execution flow:

pipeline.stage (implement)
  └─→ agent.task (code-agent)
       ├─→ gate.evaluation (test-coverage)
       └─→ gate.evaluation (security-scan)
  └─→ handoff (code-agent → reviewer-agent)
pipeline.stage (review)
  └─→ agent.task (reviewer-agent)
       └─→ gate.evaluation (human-review)

3.2 Metrics

Implementations MUST expose metrics using the names and types defined in Section 2. Metrics SHOULD be exportable via OpenTelemetry Protocol (OTLP).

Instrument types:

  • Gauge — For values that can go up and down (e.g., autonomy level, rates)
  • Counter — For values that only increase (e.g., escalation count)
  • Histogram — For distributions (e.g., task completion time, cost per task)

3.3 Logs

Implementations MUST produce structured logs for the following events:

EventRequired FieldsDescription
reconciliation.decisionresource, action, reason, resultEvery reconciliation decision
autonomy.promotionagent, from_level, to_level, criteria_metAgent promoted
autonomy.demotionagent, from_level, to_level, triggerAgent demoted
gate.overridegate, actor, role, justificationQuality gate overridden
gate.failuregate, enforcement, reasonQuality gate failed
handoff.completedsource, target, contract, validation_resultAgent handoff completed

Logs MUST be structured (JSON) and SHOULD be exportable via OTLP.


4. Provenance Tracking

Every AI-generated artifact MUST record provenance metadata. Provenance enables attribution, auditability, and regulatory compliance.

4.1 Required Fields

FieldTypeRequiredDescription
modelstringMUSTModel identifier (e.g., claude-sonnet-4-5-20250929).
toolstringMUSTTool that generated the artifact (e.g., claude-code@1.2.0).
promptHashstringMUSTSHA-256 hash of the input prompt.
timestampstring (date-time)MUSTISO 8601 generation time.
humanReviewerstringMAYIdentity of the human who reviewed the artifact.
reviewDecisionstringMAYOne of: approved, rejected, revised.
costCostReceiptMAYCost breakdown for this artifact. See RFC-0004.

CostReceipt Fields

FieldTypeRequiredDescription
totalCostnumberMUSTTotal cost of producing this artifact.
currencystringMUSTCurrency code (e.g., USD).
breakdownCostBreakdownMUSTItemized cost breakdown.
executionExecutionCostDetailMAYDetailed execution metrics.

CostBreakdown Fields

FieldTypeRequiredDescription
tokenCostnumberMUSTCost attributed to token usage.
cacheSavingsnumberMAYCost saved via cache hits.
computeCostnumberMAYNon-token compute cost.
humanReviewCostnumberMAYCost of human review time.

ExecutionCostDetail Fields

FieldTypeRequiredDescription
inputTokensnumberMUSTTotal input tokens consumed.
outputTokensnumberMUSTTotal output tokens generated.
cacheReadTokensnumberMAYTokens served from cache.
modelCallsnumberMAYNumber of model API calls.
wallClockSecondsnumberMAYWall clock execution time in seconds.
retryCountnumberMAYNumber of retried attempts.

4.2 Storage

Provenance metadata SHOULD be stored:

  • In commit metadata (git trailers or notes)
  • In PR descriptions or comments
  • In a dedicated provenance store accessible via API

Implementations MUST ensure provenance records are immutable after creation.

4.3 Attribution

When QualityGate rules include requireAttribution: true, implementations MUST verify that provenance metadata is present and complete before admitting the artifact.


5. Audit Trail Requirements

Every action in the system MUST produce an immutable, tamper-evident audit log entry.

5.1 Required Audit Fields

FieldTypeRequiredDescription
idstringMUSTUnique audit entry identifier.
timestampstring (date-time)MUSTWhen the action occurred.
actorstringMUSTIdentity of the actor (human or agent).
actorTypestringMUSTOne of: human, ai-agent, bot, service-account.
actionstringMUSTAction performed (e.g., create, update, approve, override, promote, demote).
resourcestringMUSTResource affected (kind/namespace/name).
policyEvaluatedstringMAYPolicy or gate that was evaluated.
decisionstringMUSTDecision rendered (e.g., allowed, denied, overridden).
detailsobjectMAYAdditional context (justification, metric values, etc.).

5.2 Immutability

Audit log entries MUST NOT be modifiable after creation. Implementations SHOULD use append-only storage with integrity verification (e.g., hash chains, write-once storage).

5.3 Retention

Implementations SHOULD support configurable retention policies for audit logs. The default retention period SHOULD be at least 12 months to support regulatory compliance.

5.4 Access

Audit logs MUST be accessible to authorized users via API. Implementations SHOULD support filtering by actor, action, resource, and time range.