E12 Hardening and Launch Readiness

E12 Hardening and Launch Readiness

1. Goal

Stabilize Conman for production rollout. This epic covers load testing, fault injection, SLO definition, operational dashboards, runbooks, rate limiting, and a security checklist. Unlike previous epics, E12 produces mostly tests, configuration, documentation, and observability infrastructure rather than new domain features.

After this epic, the team has concrete evidence that the system handles production-scale traffic, degrades gracefully under failure, and operators have the runbooks and alerts needed to respond to incidents.

Issues:


2. Dependencies

Dependency What it provides
E08 Releases Release assembly, composition, tagging, and publish flows to load test
E09 Deployments Deploy, promote, skip-stage, rollback flows to fault-test
E10 Temp Environments Temp env lifecycle to verify cleanup under failure
E11 Notifications & Audit Audit completeness and notification delivery to verify under load

All prior epics (E00-E11) must be functionally complete before E12 begins. E12 validates the entire system as an integrated whole.


3. Rust Types

3.1 Metrics Registry (conman-api/src/metrics.rs)

Integration with the metrics crate for lightweight, Prometheus-compatible instrumentation. The metrics crate provides a facade; the metrics-exporter-prometheus crate provides the Prometheus text format exporter.

use metrics::{counter, gauge, histogram};
use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle};

/// Initialize the global metrics recorder and return a handle for the
/// Prometheus scrape endpoint.
///
/// Called once during server startup. Subsequent calls to `counter!`,
/// `gauge!`, and `histogram!` anywhere in the codebase will record into
/// this registry.
pub fn init_metrics() -> PrometheusHandle {
    PrometheusBuilder::new()
        .install_recorder()
        .expect("failed to install Prometheus metrics recorder")
}

// ── Metric names (constants prevent typos across crates) ──

/// Total HTTP requests received, labeled by method, path pattern, and status.
pub const HTTP_REQUESTS_TOTAL: &str = "conman_http_requests_total";

/// HTTP request duration in seconds, labeled by method and path pattern.
pub const HTTP_REQUEST_DURATION_SECONDS: &str = "conman_http_request_duration_seconds";

/// Total jobs enqueued, labeled by job type.
pub const JOBS_ENQUEUED_TOTAL: &str = "conman_jobs_enqueued_total";

/// Total jobs completed, labeled by job type and outcome (succeeded/failed).
pub const JOBS_COMPLETED_TOTAL: &str = "conman_jobs_completed_total";

/// Job processing duration in seconds, labeled by job type.
pub const JOB_DURATION_SECONDS: &str = "conman_job_duration_seconds";

/// Current number of jobs in `queued` state, labeled by job type.
pub const JOB_QUEUE_DEPTH: &str = "conman_job_queue_depth";

/// Total deployments attempted, labeled by outcome (succeeded/failed/canceled).
pub const DEPLOYMENTS_TOTAL: &str = "conman_deployments_total";

/// Total gitaly gRPC calls, labeled by method and outcome.
pub const GITALY_CALLS_TOTAL: &str = "conman_gitaly_calls_total";

/// Gitaly gRPC call duration in seconds, labeled by method.
pub const GITALY_CALL_DURATION_SECONDS: &str = "conman_gitaly_call_duration_seconds";

/// Total authentication failures (bad password, expired token, etc.).
pub const AUTH_FAILURES_TOTAL: &str = "conman_auth_failures_total";

/// Total rate-limited requests.
pub const RATE_LIMITED_TOTAL: &str = "conman_rate_limited_total";

3.2 HTTP Metrics Middleware (conman-api/src/middleware/metrics.rs)

Records request count and duration for every HTTP request.

use axum::{extract::Request, middleware::Next, response::Response};
use metrics::{counter, histogram};
use std::time::Instant;

use crate::metrics::{HTTP_REQUESTS_TOTAL, HTTP_REQUEST_DURATION_SECONDS};

/// Middleware that records HTTP request count and latency.
///
/// Labels: `method`, `path` (matched route pattern, not raw URL), `status`.
/// Must be applied as an outer layer so it captures the full request lifecycle.
pub async fn http_metrics_middleware(req: Request, next: Next) -> Response {
    let method = req.method().to_string();

    // Use the matched path pattern (e.g. "/api/apps/:appId") to avoid
    // high-cardinality labels from path parameters.
    let path = req
        .extensions()
        .get::<axum::extract::MatchedPath>()
        .map(|mp| mp.as_str().to_string())
        .unwrap_or_else(|| "unknown".to_string());

    let start = Instant::now();
    let response = next.run(req).await;
    let duration = start.elapsed().as_secs_f64();

    let status = response.status().as_u16().to_string();

    counter!(HTTP_REQUESTS_TOTAL, "method" => method.clone(), "path" => path.clone(), "status" => status);
    histogram!(HTTP_REQUEST_DURATION_SECONDS, "method" => method, "path" => path).record(duration);

    response
}

3.3 Enhanced Health Check (conman-api/src/handlers/health.rs)

Extends the E00 health endpoint with per-component status and version metadata.

use axum::extract::State;
use axum::http::StatusCode;
use axum::Json;
use serde::Serialize;

use crate::state::AppState;

/// Component-level health status.
#[derive(Debug, Clone, Serialize)]
#[serde(rename_all = "snake_case")]
pub enum ComponentStatus {
    /// Component is reachable and responding within acceptable latency.
    Healthy,
    /// Component is reachable but degraded (e.g. high latency, replica lag).
    Degraded,
    /// Component is unreachable or returning errors.
    Unhealthy,
}

/// Individual component health report.
#[derive(Debug, Clone, Serialize)]
pub struct ComponentHealth {
    pub name: &'static str,
    pub status: ComponentStatus,
    /// Human-readable detail (e.g. "ping: 2ms", "connection refused").
    #[serde(skip_serializing_if = "Option::is_none")]
    pub detail: Option<String>,
}

/// Enhanced health response with component-level breakdown.
#[derive(Debug, Clone, Serialize)]
pub struct HealthResponse {
    /// Overall status: "ok" if all components healthy, "degraded" otherwise.
    pub status: &'static str,
    /// Application version from compile-time env.
    pub version: &'static str,
    /// Individual component checks.
    pub components: Vec<ComponentHealth>,
}

/// GET /api/health
///
/// Returns detailed health status for each dependency. Returns 200 when all
/// components are healthy, 503 when any component is unhealthy. Does not
/// require authentication.
pub async fn health_check(State(state): State<AppState>) -> (StatusCode, Json<HealthResponse>) {
    let mut components = Vec::with_capacity(3);
    let mut all_healthy = true;

    // Check MongoDB connectivity.
    let mongo_health = match check_mongo(&state).await {
        Ok(detail) => ComponentHealth {
            name: "mongodb",
            status: ComponentStatus::Healthy,
            detail: Some(detail),
        },
        Err(detail) => {
            all_healthy = false;
            ComponentHealth {
                name: "mongodb",
                status: ComponentStatus::Unhealthy,
                detail: Some(detail),
            }
        }
    };
    components.push(mongo_health);

    // Check Gitaly gRPC channel.
    let gitaly_health = match check_gitaly(&state).await {
        Ok(detail) => ComponentHealth {
            name: "gitaly",
            status: ComponentStatus::Healthy,
            detail: Some(detail),
        },
        Err(detail) => {
            all_healthy = false;
            ComponentHealth {
                name: "gitaly",
                status: ComponentStatus::Unhealthy,
                detail: Some(detail),
            }
        }
    };
    components.push(gitaly_health);

    // Check job runner liveness.
    let job_runner_health = match check_job_runner(&state).await {
        Ok(detail) => ComponentHealth {
            name: "job_runner",
            status: ComponentStatus::Healthy,
            detail: Some(detail),
        },
        Err(detail) => {
            all_healthy = false;
            ComponentHealth {
                name: "job_runner",
                status: ComponentStatus::Unhealthy,
                detail: Some(detail),
            }
        }
    };
    components.push(job_runner_health);

    let status = if all_healthy { "ok" } else { "degraded" };
    let http_status = if all_healthy {
        StatusCode::OK
    } else {
        StatusCode::SERVICE_UNAVAILABLE
    };

    (
        http_status,
        Json(HealthResponse {
            status,
            version: env!("CARGO_PKG_VERSION"),
            components,
        }),
    )
}

/// Ping MongoDB and return round-trip time.
async fn check_mongo(state: &AppState) -> Result<String, String> {
    let start = std::time::Instant::now();
    conman_db::check_mongo_health(&state.db)
        .await
        .map(|_| format!("ping: {}ms", start.elapsed().as_millis()))
        .map_err(|e| e.to_string())
}

/// Verify the Gitaly gRPC channel is connected.
async fn check_gitaly(state: &AppState) -> Result<String, String> {
    match &state.gitaly_channel {
        Some(_channel) => {
            // Attempt a lightweight ServerInfo or similar RPC.
            // For now, channel existence indicates the connection was established.
            Ok("channel connected".to_string())
        }
        None => Err("channel not available".to_string()),
    }
}

/// Verify the job runner is alive by checking its heartbeat timestamp.
async fn check_job_runner(state: &AppState) -> Result<String, String> {
    // The job runner writes a heartbeat timestamp to a known MongoDB document.
    // If the heartbeat is older than 60 seconds, the runner is considered unhealthy.
    let _ = state;
    // Implementation: query `job_runner_heartbeat` document from MongoDB.
    // Placeholder -- will be wired when E06 job runner is available.
    Ok("heartbeat current".to_string())
}

3.4 Rate Limiter (conman-api/src/middleware/rate_limit.rs)

Per-user rate limiting using a token bucket algorithm backed by an in-memory store. For single-instance v1 deployment this is sufficient; a Redis-backed implementation can replace the store later without changing the middleware.

use axum::{extract::Request, http::StatusCode, middleware::Next, response::Response};
use dashmap::DashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};

use crate::response::{ApiError, ApiErrorBody};
use crate::request_context::RequestContext;

/// Configuration for the rate limiter.
#[derive(Debug, Clone)]
pub struct RateLimitConfig {
    /// Maximum requests per window per user.
    pub max_requests: u64,
    /// Window duration.
    pub window: Duration,
}

impl Default for RateLimitConfig {
    fn default() -> Self {
        Self {
            max_requests: 100,
            window: Duration::from_secs(60),
        }
    }
}

/// Per-user token bucket entry.
#[derive(Debug, Clone)]
struct BucketEntry {
    remaining: u64,
    window_start: Instant,
}

/// In-memory rate limit store. One entry per authenticated user.
///
/// Thread-safe via `DashMap`. Entries are lazily evicted when accessed
/// past their window.
#[derive(Debug, Clone)]
pub struct RateLimitStore {
    buckets: Arc<DashMap<String, BucketEntry>>,
    config: RateLimitConfig,
}

impl RateLimitStore {
    pub fn new(config: RateLimitConfig) -> Self {
        Self {
            buckets: Arc::new(DashMap::new()),
            config,
        }
    }

    /// Attempt to consume one token for the given user.
    /// Returns `Ok(remaining)` if allowed, `Err(())` if rate limited.
    pub fn check(&self, user_id: &str) -> Result<u64, ()> {
        let now = Instant::now();
        let mut entry = self.buckets.entry(user_id.to_string()).or_insert_with(|| {
            BucketEntry {
                remaining: self.config.max_requests,
                window_start: now,
            }
        });

        // Reset window if expired.
        if now.duration_since(entry.window_start) >= self.config.window {
            entry.remaining = self.config.max_requests;
            entry.window_start = now;
        }

        if entry.remaining == 0 {
            return Err(());
        }

        entry.remaining -= 1;
        Ok(entry.remaining)
    }
}

/// Rate limiting middleware.
///
/// Extracts the authenticated user ID from request extensions (set by the
/// auth middleware). Unauthenticated requests are not rate-limited here
/// (they are rejected by the auth middleware first).
///
/// Returns 429 Too Many Requests when the limit is exceeded.
pub async fn rate_limit_middleware(
    req: Request,
    next: Next,
    store: RateLimitStore,
) -> Response {
    // Extract user ID from auth context if present.
    let user_id = req
        .extensions()
        .get::<crate::auth::AuthUser>()
        .map(|u| u.user_id.to_string());

    if let Some(uid) = user_id {
        match store.check(&uid) {
            Ok(remaining) => {
                let mut response = next.run(req).await;
                // Attach rate limit headers for client awareness.
                if let Ok(val) = remaining.to_string().parse() {
                    response.headers_mut().insert("X-RateLimit-Remaining", val);
                }
                response
            }
            Err(()) => {
                metrics::counter!(
                    crate::metrics::RATE_LIMITED_TOTAL,
                    "user_id" => uid
                );

                let body = ApiError {
                    error: ApiErrorBody {
                        code: "rate_limited",
                        message: "Too many requests. Please wait and try again.".to_string(),
                        request_id: RequestContext::current_request_id(),
                    },
                };

                (StatusCode::TOO_MANY_REQUESTS, axum::Json(body)).into_response()
            }
        }
    } else {
        // No authenticated user -- skip rate limiting.
        next.run(req).await
    }
}

4. Database

4.1 Index Review

E12 does not introduce new collections. Instead, it audits all existing indexes across every collection for query performance under load. The following index review must be completed and verified:

Collection Index Purpose Type
apps { name: 1 } App lookup by name unique
apps { repo_path: 1 } App lookup by repo path unique
app_memberships { user_id: 1, app_id: 1 } Membership lookup unique compound
app_memberships { app_id: 1 } List members of an app standard
workspaces { app_id: 1, owner_user_id: 1 } User's workspaces per app compound
workspaces { app_id: 1, branch_name: 1 } Branch uniqueness per app unique compound
changesets { app_id: 1, state: 1 } List changesets by state (queue view) compound
changesets { workspace_id: 1, state: 1 } One open changeset per workspace compound
changesets { app_id: 1, author_user_id: 1 } User's changesets per app compound
changeset_revisions { changeset_id: 1, revision_number: 1 } Revision lookup unique compound
changeset_reviews { changeset_id: 1 } Reviews for a changeset standard
changeset_comments { changeset_id: 1, created_at: 1 } Paginated comment listing compound
release_batches { app_id: 1, state: 1 } Releases by state compound
release_batches { app_id: 1, tag: 1 } Release lookup by tag unique compound
release_changesets { release_id: 1 } Changesets in a release standard
environments { app_id: 1, name: 1 } Env name uniqueness per app unique compound
environments { app_id: 1, position: 1 } Env position uniqueness per app unique compound
deployments { app_id: 1, environment_id: 1, state: 1 } Active deployment lock per env compound
deployments { release_id: 1 } Deployments for a release standard
temp_environments { app_id: 1, state: 1 } Active temp envs per app compound
temp_environments { expires_at: 1 } TTL expiry scan (job runner) standard
jobs { state: 1, created_at: 1 } Job polling (FIFO by state) compound
jobs { app_id: 1, type: 1, state: 1 } Job lookup by app and type compound
audit_events { app_id: 1, occurred_at: -1 } Audit timeline per app compound
audit_events { entity_type: 1, entity_id: 1 } Audit for specific entity compound
notification_preferences { user_id: 1 } User prefs lookup unique
invites { app_id: 1, email: 1 } Invite uniqueness per app unique compound
invites { token: 1 } Invite acceptance lookup unique

Action items:

4.2 Read Preference and Write Concern

For production MongoDB replica set deployment:

Operation type Read preference Write concern Rationale
Health check ping primaryPreferred n/a Tolerate primary failover for health
Reads (listings, detail) secondaryPreferred n/a Spread read load; accept slight staleness
Writes (mutations) n/a { w: "majority", j: true } Durability: acknowledged by majority with journal
Job polling primary { w: "majority" } Avoid duplicate job pickup during failover
Audit event writes n/a { w: 1, j: false } Fire-and-forget; acceptable to lose rare event under crash

These should be configured per-operation, not globally, using the MongoDB driver's ReadPreference and WriteConcern options on individual collection handles or operation options.

4.3 Backup Strategy


5. API Endpoints

5.1 GET /api/health (enhanced)

Replaces the E00 basic health check with component-level status.

Attribute Value
Auth None (public)
Rate limit Exempt

Response 200 (all components healthy):

{
  "status": "ok",
  "version": "0.1.0",
  "components": [
    { "name": "mongodb", "status": "healthy", "detail": "ping: 2ms" },
    { "name": "gitaly", "status": "healthy", "detail": "channel connected" },
    { "name": "job_runner", "status": "healthy", "detail": "heartbeat current" }
  ]
}

Response 503 (one or more components unhealthy):

{
  "status": "degraded",
  "version": "0.1.0",
  "components": [
    { "name": "mongodb", "status": "healthy", "detail": "ping: 3ms" },
    { "name": "gitaly", "status": "unhealthy", "detail": "channel not available" },
    { "name": "job_runner", "status": "healthy", "detail": "heartbeat current" }
  ]
}

5.2 GET /api/metrics (Prometheus scrape endpoint)

Attribute Value
Auth None (should be network-restricted in production via firewall/ingress rules)
Rate limit Exempt
Content-Type text/plain; version=0.0.4; charset=utf-8

Response 200:

# HELP conman_http_requests_total Total HTTP requests received.
# TYPE conman_http_requests_total counter
conman_http_requests_total{method="GET",path="/api/apps",status="200"} 1042
conman_http_requests_total{method="POST",path="/api/apps/:appId/changesets",status="201"} 87

# HELP conman_http_request_duration_seconds HTTP request duration.
# TYPE conman_http_request_duration_seconds histogram
conman_http_request_duration_seconds_bucket{method="GET",path="/api/apps",le="0.1"} 980
...

# HELP conman_job_queue_depth Current queued jobs.
# TYPE conman_job_queue_depth gauge
conman_job_queue_depth{type="revalidate_queued_changeset"} 3
conman_job_queue_depth{type="deploy_release"} 0
...

Handler implementation:

use axum::response::IntoResponse;
use metrics_exporter_prometheus::PrometheusHandle;

/// GET /api/metrics
///
/// Returns metrics in Prometheus text exposition format. Not authenticated
/// -- restrict access via network policy in production.
pub async fn metrics_endpoint(
    State(handle): State<PrometheusHandle>,
) -> impl IntoResponse {
    handle.render()
}

6. Business Logic

6.1 Load Test Scenarios

All load tests use a dedicated test environment with a populated MongoDB and a Gitaly instance backed by realistic repository data (not empty repos).

# Scenario Parameters Target Tool
L1 Concurrent file edits 50 users, each editing 5 files in their own workspace All 250 edits succeed within 2s per request k6 or drill
L2 Changeset submission storm 50 concurrent changeset submissions, each triggering msuite_submit job All 50 submissions accepted, jobs enqueued within 1s k6
L3 Queue with 100+ changesets Seed 150 queued changesets, then publish a release of 10 Post-publish revalidation of remaining 140 completes within 10 minutes custom Rust test harness
L4 Rapid release cycle 5 releases published sequentially with 60s gap, each with 5 changesets No data corruption, all revalidation loops complete, no orphaned jobs custom Rust test harness
L5 Large repository operations Repository with 10,000+ files across 500+ directories. Perform tree listing, file read, diff operations Tree listing < 3s, single file read < 500ms, diff < 5s k6
L6 Deployment pipeline 10 concurrent deploy requests across different environments for different apps Each deployment runs to completion, environment locks enforced, no double-deploys k6
L7 API listing under load 50 concurrent requests to GET /api/apps/:appId/changesets?state=queued with 500 changesets in DB p99 response time < 500ms, no timeouts k6

6.2 Fault Injection Scenarios

# Scenario Injection method Expected behavior
F1 Gitaly connection drop Kill gitaly process (or iptables drop) mid-request API returns 502 git_error. Retry logic in GitalyClient attempts 3 retries with backoff. Non-git endpoints remain operational. Health endpoint reports gitaly unhealthy.
F2 Gitaly slow response Inject 10s delay on gitaly responses (tc netem or proxy) Requests with git operations time out at configured deadline (default 30s). Client receives 504 or 502. Non-git endpoints unaffected.
F3 MongoDB primary failover rs.stepDown() on primary Writes fail briefly during election (typically 2-10s). Health endpoint returns 503 during election. After new primary elected, operations resume automatically. No data loss for majority-acknowledged writes.
F4 MongoDB full outage Stop all replica set members All API endpoints return 500/503. Health endpoint returns 503 with mongodb unhealthy. Server does not crash. Recovery is automatic when MongoDB comes back.
F5 Job worker crash mid-execution kill -9 the process while a job is in running state Job remains in running state with stale locked_until. Job runner picks it up after lock expiry (configurable, default 5 minutes). Job is retried. Idempotency ensures no duplicate side effects.
F6 Job worker crash during revalidation storm Kill worker while 50+ revalidation jobs are in progress All in-progress jobs are re-picked after lock expiry. Remaining queued jobs are processed normally. No changeset is left in an inconsistent state.
F7 Network partition between API and job runner Block network between API server and MongoDB for job runner only API continues serving reads from secondary. Job runner stops picking jobs. Health endpoint shows job_runner degraded. When partition heals, job runner resumes.

6.3 SLO Definitions

These SLOs apply to the production deployment. They are measured over a rolling 30-day window.

SLO Metric Target Measurement
API availability Successful (non-5xx) responses / total responses >= 99.9% Prometheus: rate(conman_http_requests_total{status!~"5.."}[30d]) / rate(conman_http_requests_total[30d])
API latency (p99) 99th percentile response time for non-background endpoints < 500ms Prometheus: histogram_quantile(0.99, rate(conman_http_request_duration_seconds_bucket[5m]))
Job processing (p99) 99th percentile time from job enqueue to completion < 30s Custom metric: conman_job_duration_seconds
Job processing (p99) for deployments 99th percentile deploy job duration < 120s conman_job_duration_seconds{type="deploy_release"}
Deployment success rate Succeeded deployments / total non-canceled deployments >= 99% Prometheus: rate(conman_deployments_total{outcome="succeeded"}[30d]) / rate(conman_deployments_total{outcome!="canceled"}[30d])
Revalidation turnaround Time from release publish to all queued changeset revalidations complete < 10 minutes for 100 queued changesets Custom metric with event timestamps

Alert thresholds (Prometheus alerting rules):

Alert Condition Severity Action
ConmanHighErrorRate 5xx rate > 1% over 5 minutes P1 Page on-call
ConmanHighLatency p99 latency > 1s over 5 minutes P2 Notify on-call
ConmanJobQueueBacklog conman_job_queue_depth > 50 for any type for 10 minutes P2 Notify on-call, check job runner health
ConmanJobRunnerDown Job runner heartbeat stale > 2 minutes P1 Page on-call, restart job runner
ConmanGitalyUnhealthy Health check reports gitaly unhealthy for 2 minutes P1 Page on-call, check gitaly process
ConmanMongoUnhealthy Health check reports mongodb unhealthy for 1 minute P1 Page on-call, check replica set
ConmanDeploymentFailure Any deployment enters failed state P2 Notify config manager and on-call
ConmanTempEnvLeaking Temp environments in expired state with grace_until in the past > 1 hour P3 Investigate cleanup job

6.4 Rate Limiting

Per-user rate limits applied after authentication middleware:

Scope Limit Window Notes
Global per-user 100 requests 60 seconds Applies to all authenticated endpoints
Write endpoints (POST/PUT/PATCH/DELETE) 30 requests 60 seconds Prevents mutation storms
Auth endpoints (/api/auth/*) 10 requests 60 seconds Brute-force protection (per IP, not per user)

Rate limit response (HTTP 429):

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests. Please wait and try again.",
    "request_id": "018f2f35-2e63-7b3b-b5e1-9f0d3a2c4b10"
  }
}

Response headers on all authenticated requests:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1719849600

6.5 Security Checklist

Every item must be verified with a passing test before launch.

Authentication:

# Item Requirement Verification
S1 Password minimum length >= 12 characters Unit test in conman-auth
S2 Password hashing algorithm Argon2id with m_cost=19456, t_cost=2, p_cost=1 (OWASP recommendation) Unit test verifying hash format
S3 Password hash timing Verification takes 100-500ms (prevents timing attacks while maintaining UX) Benchmark test
S4 JWT expiry 24 hours (configurable via CONMAN_JWT_EXPIRY_HOURS) Integration test: token issued, wait (or mock time), verify rejection
S5 JWT secret strength Minimum 32 bytes, validated at startup Startup validation in Config::from_env()
S6 Invite token expiry 7 days (configurable via CONMAN_INVITE_EXPIRY_DAYS) Integration test: expired invite rejected
S7 Password reset token Single-use, 1-hour expiry Integration test: used token rejected on second use
S8 Failed login throttling After 5 failed attempts for same email, enforce 15-minute cooldown Integration test

Authorization (RBAC):

# Item Requirement Verification
S9 user cannot approve changeset Returns 403 Integration test
S10 user cannot assemble release Returns 403 Integration test
S11 user cannot deploy Returns 403 Integration test
S12 reviewer cannot assemble release Returns 403 Integration test
S13 reviewer cannot manage settings Returns 403 Integration test
S14 Non-member cannot access app Returns 403 Integration test
S15 Role escalation via API Sending role: "app_admin" in membership update as non-admin returns 403 Integration test
S16 Cross-app access User with role on app A cannot access app B resources Integration test

Input validation:

# Item Requirement Verification
S17 NoSQL injection in query params ?name[$gt]= and similar MongoDB operator injection attempts are rejected or sanitized Integration test
S18 NoSQL injection in JSON body {"name": {"$gt": ""}} rejected by type-safe deserialization (serde rejects objects where String expected) Unit test
S19 Path traversal in file operations ../../etc/passwd and similar traversal in file path parameter is blocked Integration test
S20 Path traversal with encoded chars ..%2F..%2Fetc%2Fpasswd is blocked Integration test
S21 Blocked path enforcement Editing .git/config or .github/workflows/ci.yml returns 403 Integration test
S22 File size limit enforcement Upload exceeding file_size_limit_bytes returns 400 Integration test
S23 Request body size limit Request bodies > 10 MB rejected at middleware level Integration test
S24 XSS in changeset comments HTML/script tags in comment body are stored as-is (no execution context in API-only backend) but validated for max length Unit test
S25 Branch name injection Workspace branch names cannot contain .., leading -, or shell metacharacters Unit test

7. Gitaly-rs Integration

7.1 Connection Resilience Testing

The GitalyClient retry logic (introduced in E01) must be validated under adversarial conditions:

Test Setup Expected
Retry on UNAVAILABLE Mock gitaly returns UNAVAILABLE twice, then success Operation succeeds after 3rd attempt
Retry on DEADLINE_EXCEEDED Mock gitaly returns DEADLINE_EXCEEDED once, then success Operation succeeds after 2nd attempt
No retry on NOT_FOUND Mock gitaly returns NOT_FOUND Operation fails immediately, no retry
No retry on INVALID_ARGUMENT Mock gitaly returns INVALID_ARGUMENT Operation fails immediately, no retry
Max retries exhausted Mock gitaly returns UNAVAILABLE 4 times Operation fails after 3 retries
Backoff timing Mock gitaly returns UNAVAILABLE 3 times, measure delays Delays follow exponential backoff: ~100ms, ~200ms, ~400ms (+/- jitter)
Channel reconnect after restart Stop gitaly, wait 5s, restart gitaly, make request Request succeeds (Tonic channel reconnects automatically)

7.2 Timeout Configuration

Operation Recommended timeout Rationale
RepositoryExists / CreateRepository 5s Lightweight metadata operations
TreeEntry / GetBlob 10s File reads scale with file size
CommitDiff 30s Diffs on large changesets can be expensive
UserCommitFiles 30s Streaming writes for multi-file commits
MergeToRef / UserMergeBranch 60s Merge operations on large repos may be slow

These timeouts should be configurable via environment variables:

CONMAN_GITALY_TIMEOUT_DEFAULT=10s
CONMAN_GITALY_TIMEOUT_DIFF=30s
CONMAN_GITALY_TIMEOUT_MERGE=60s

8. Implementation Checklist

This epic is test-heavy and configuration-focused. Steps are organized by sub-issue rather than sequential commits.

E12-01: Load and Performance Testing

E12-02: Fault Injection Testing

E12-03: SLOs and Operational Dashboards

E12-04: Runbooks

All runbooks written as markdown in docs/runbooks/. Each follows the template: Trigger, Impact, Diagnosis, Resolution, Prevention.

E12-05: Security Hardening


9. Test Cases

9.1 Load test: 50 concurrent users editing files

// k6 script: tests/load/concurrent_edits.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
    vus: 50,
    duration: '2m',
    thresholds: {
        http_req_duration: ['p(99)<2000'],  // p99 < 2s
        http_req_failed: ['rate<0.01'],      // < 1% failure rate
    },
};

export default function () {
    const userId = __VU;
    const workspaceId = `ws-${userId}`;
    const fileName = `config/service-${userId}/settings.json`;

    const res = http.put(
        `${__ENV.BASE_URL}/api/apps/${__ENV.APP_ID}/workspaces/${workspaceId}/files`,
        JSON.stringify({ path: fileName, content: `{"vu": ${userId}, "iter": ${__ITER}}` }),
        { headers: { 'Authorization': `Bearer ${__ENV.TOKEN}`, 'Content-Type': 'application/json' } }
    );

    check(res, {
        'status is 200': (r) => r.status === 200,
        'response time < 2s': (r) => r.timings.duration < 2000,
    });

    sleep(1);
}

9.2 Fault test: gitaly goes down, API returns 503 gracefully

#[tokio::test]
async fn gitaly_down_returns_502_for_git_operations() {
    // Start mock gitaly that immediately drops connections.
    let mock_gitaly = MockGitalyServer::start_refusing_connections().await;
    let app = test_app_with_gitaly(mock_gitaly.address()).await;

    // Non-git endpoint should still work.
    let health_res = app
        .oneshot(Request::builder().uri("/api/health").body(Body::empty()).unwrap())
        .await
        .unwrap();
    // Health endpoint reports gitaly as unhealthy but responds 503 overall.
    assert_eq!(health_res.status(), StatusCode::SERVICE_UNAVAILABLE);
    let body: serde_json::Value = parse_body(health_res).await;
    assert_eq!(body["status"], "degraded");

    let gitaly_component = body["components"]
        .as_array()
        .unwrap()
        .iter()
        .find(|c| c["name"] == "gitaly")
        .unwrap();
    assert_eq!(gitaly_component["status"], "unhealthy");

    // Git operation should return 502.
    let file_res = app_clone
        .oneshot(
            Request::builder()
                .uri("/api/apps/test-app/workspaces/ws-1/files?path=config.json")
                .header("Authorization", "Bearer valid-token")
                .body(Body::empty())
                .unwrap(),
        )
        .await
        .unwrap();
    assert_eq!(file_res.status(), StatusCode::BAD_GATEWAY);

    let err_body: serde_json::Value = parse_body(file_res).await;
    assert_eq!(err_body["error"]["code"], "git_error");
}

9.3 Fault test: MongoDB primary failover, operations resume

#[tokio::test]
async fn mongo_failover_recovers_automatically() {
    // Requires a 3-node replica set (testcontainers or local).
    let rs = MongoReplicaSet::start(3).await;
    let app = test_app_with_mongo(rs.connection_string()).await;

    // Verify initial connectivity.
    let res = app.clone().oneshot(health_request()).await.unwrap();
    assert_eq!(res.status(), StatusCode::OK);

    // Force primary step-down.
    rs.step_down_primary().await;

    // Requests may fail briefly during election.
    tokio::time::sleep(Duration::from_secs(2)).await;

    // After election completes, operations should resume.
    // Retry for up to 15 seconds.
    let mut recovered = false;
    for _ in 0..15 {
        let res = app.clone().oneshot(health_request()).await.unwrap();
        if res.status() == StatusCode::OK {
            recovered = true;
            break;
        }
        tokio::time::sleep(Duration::from_secs(1)).await;
    }
    assert!(recovered, "Server did not recover after MongoDB failover within 15s");
}

9.4 Fault test: job worker crashes, job re-picked on restart

#[tokio::test]
async fn crashed_job_is_retried_after_lock_expiry() {
    let db = test_mongo_db().await;
    let job_repo = JobRepo::new(db.clone());

    // Insert a job in "running" state with a stale lock (expired 1 minute ago).
    let stale_job = Job {
        id: ObjectId::new().to_hex(),
        job_type: "msuite_submit".to_string(),
        state: "running".to_string(),
        locked_until: Some(Utc::now() - Duration::from_secs(60)),
        attempts: 1,
        max_attempts: 3,
        ..test_job()
    };
    job_repo.insert(&stale_job).await.unwrap();

    // Start a new job runner instance.
    let runner = JobRunner::new(db.clone(), mock_workers());
    let picked = runner.poll_next_job().await.unwrap();

    // The stale job should be picked up for retry.
    assert!(picked.is_some());
    assert_eq!(picked.unwrap().id, stale_job.id);
    assert_eq!(picked.unwrap().attempts, 2);
}

9.5 Security test: NoSQL injection attempts blocked

#[tokio::test]
async fn nosql_injection_in_query_param_rejected() {
    let app = test_app().await;

    // Attempt MongoDB operator injection via query parameter.
    let res = app
        .oneshot(
            Request::builder()
                .uri("/api/apps?name[$gt]=")
                .header("Authorization", "Bearer valid-token")
                .body(Body::empty())
                .unwrap(),
        )
        .await
        .unwrap();

    // Should not return all apps -- either 400 (bad param) or empty results.
    // Must NOT return documents where name > "".
    assert!(
        res.status() == StatusCode::BAD_REQUEST || {
            let body: serde_json::Value = parse_body(res).await;
            body["data"].as_array().map_or(true, |arr| arr.is_empty())
        }
    );
}

#[tokio::test]
async fn nosql_injection_in_json_body_rejected() {
    let app = test_app().await;

    // Attempt operator injection in JSON body.
    // Serde's typed deserialization rejects objects where a String is expected.
    let res = app
        .oneshot(
            Request::builder()
                .method("POST")
                .uri("/api/apps")
                .header("Authorization", "Bearer admin-token")
                .header("Content-Type", "application/json")
                .body(Body::from(r#"{"name": {"$gt": ""}, "repo_path": "test.git"}"#))
                .unwrap(),
        )
        .await
        .unwrap();

    // Serde deserialization fails: name expects a string, not an object.
    assert_eq!(res.status(), StatusCode::BAD_REQUEST);
}

9.6 Security test: path traversal in file operations blocked

#[tokio::test]
async fn path_traversal_blocked() {
    let app = test_app_with_workspace().await;

    let traversal_paths = vec![
        "../../etc/passwd",
        "..%2F..%2Fetc%2Fpasswd",
        "config/../../../etc/shadow",
        "/etc/passwd",
        "config/../../.git/config",
    ];

    for path in traversal_paths {
        let res = app
            .clone()
            .oneshot(
                Request::builder()
                    .uri(&format!(
                        "/api/apps/test-app/workspaces/ws-1/files?path={}",
                        path
                    ))
                    .header("Authorization", "Bearer valid-token")
                    .body(Body::empty())
                    .unwrap(),
            )
            .await
            .unwrap();

        assert!(
            res.status() == StatusCode::BAD_REQUEST || res.status() == StatusCode::FORBIDDEN,
            "Path traversal not blocked for: {path}"
        );
    }
}

9.7 Security test: expired JWT rejected

#[tokio::test]
async fn expired_jwt_rejected() {
    let app = test_app().await;

    // Generate a JWT that expired 1 hour ago.
    let expired_token = issue_test_jwt(
        "user@example.com",
        Utc::now() - chrono::Duration::hours(1),
    );

    let res = app
        .oneshot(
            Request::builder()
                .uri("/api/apps")
                .header("Authorization", format!("Bearer {expired_token}"))
                .body(Body::empty())
                .unwrap(),
        )
        .await
        .unwrap();

    assert_eq!(res.status(), StatusCode::UNAUTHORIZED);
    let body: serde_json::Value = parse_body(res).await;
    assert_eq!(body["error"]["code"], "unauthorized");
}

9.8 Security test: role escalation attempts blocked

#[tokio::test]
async fn user_cannot_escalate_own_role() {
    let app = test_app_with_membership("user@test.com", Role::User).await;

    // Attempt to update own role to app_admin via the membership API.
    let res = app
        .oneshot(
            Request::builder()
                .method("PATCH")
                .uri("/api/apps/test-app/members/self")
                .header("Authorization", "Bearer user-token")
                .header("Content-Type", "application/json")
                .body(Body::from(r#"{"role": "app_admin"}"#))
                .unwrap(),
        )
        .await
        .unwrap();

    assert_eq!(res.status(), StatusCode::FORBIDDEN);
}

#[tokio::test]
async fn reviewer_cannot_manage_settings() {
    let app = test_app_with_membership("reviewer@test.com", Role::Reviewer).await;

    let res = app
        .oneshot(
            Request::builder()
                .method("PATCH")
                .uri("/api/apps/test-app/settings")
                .header("Authorization", "Bearer reviewer-token")
                .header("Content-Type", "application/json")
                .body(Body::from(r#"{"baseline_mode": "integration_head"}"#))
                .unwrap(),
        )
        .await
        .unwrap();

    assert_eq!(res.status(), StatusCode::FORBIDDEN);
}

10. Acceptance Criteria

Go-Live Checklist

Every item must be verified and signed off before production deployment.

Performance:

Resilience:

Observability:

Operational:

Security:

No P0 blockers remaining in the issue tracker.