E12 Hardening and Launch Readiness
E12 Hardening and Launch Readiness
1. Goal
Stabilize Conman for production rollout. This epic covers load testing, fault injection, SLO definition, operational dashboards, runbooks, rate limiting, and a security checklist. Unlike previous epics, E12 produces mostly tests, configuration, documentation, and observability infrastructure rather than new domain features.
After this epic, the team has concrete evidence that the system handles production-scale traffic, degrades gracefully under failure, and operators have the runbooks and alerts needed to respond to incidents.
Issues:
- E12-01: Load and performance testing against large real repositories.
- E12-02: Fault-injection tests for Git adapter and job worker crashes.
- E12-03: SLO definitions and operational dashboards (queue depth, job latency, deployment success rate).
- E12-04: Runbooks for release failure, revalidation storms, temp env cleanup.
- E12-05: Security checklist (password policy, token expiry, RBAC tests, input validation).
- E12-06: Runtime-profile secret encryption and manual key-rotation drills.
2. Dependencies
| Dependency | What it provides |
|---|---|
| E08 Releases | Release assembly, composition, tagging, and publish flows to load test |
| E09 Deployments | Deploy, promote, skip-stage, rollback flows to fault-test |
| E10 Temp Environments | Temp env lifecycle to verify cleanup under failure |
| E11 Notifications & Audit | Audit completeness and notification delivery to verify under load |
All prior epics (E00-E11) must be functionally complete before E12 begins. E12 validates the entire system as an integrated whole.
3. Rust Types
3.1 Metrics
Registry (conman-api/src/metrics.rs)
Integration with the metrics crate for
lightweight, Prometheus-compatible instrumentation. The
metrics crate provides a facade; the
metrics-exporter-prometheus crate provides the
Prometheus text format exporter.
use metrics::{counter, gauge, histogram};
use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle};
/// Initialize the global metrics recorder and return a handle for the
/// Prometheus scrape endpoint.
///
/// Called once during server startup. Subsequent calls to `counter!`,
/// `gauge!`, and `histogram!` anywhere in the codebase will record into
/// this registry.
pub fn init_metrics() -> PrometheusHandle {
PrometheusBuilder::new()
.install_recorder()
.expect("failed to install Prometheus metrics recorder")
}
// ── Metric names (constants prevent typos across crates) ──
/// Total HTTP requests received, labeled by method, path pattern, and status.
pub const HTTP_REQUESTS_TOTAL: &str = "conman_http_requests_total";
/// HTTP request duration in seconds, labeled by method and path pattern.
pub const HTTP_REQUEST_DURATION_SECONDS: &str = "conman_http_request_duration_seconds";
/// Total jobs enqueued, labeled by job type.
pub const JOBS_ENQUEUED_TOTAL: &str = "conman_jobs_enqueued_total";
/// Total jobs completed, labeled by job type and outcome (succeeded/failed).
pub const JOBS_COMPLETED_TOTAL: &str = "conman_jobs_completed_total";
/// Job processing duration in seconds, labeled by job type.
pub const JOB_DURATION_SECONDS: &str = "conman_job_duration_seconds";
/// Current number of jobs in `queued` state, labeled by job type.
pub const JOB_QUEUE_DEPTH: &str = "conman_job_queue_depth";
/// Total deployments attempted, labeled by outcome (succeeded/failed/canceled).
pub const DEPLOYMENTS_TOTAL: &str = "conman_deployments_total";
/// Total gitaly gRPC calls, labeled by method and outcome.
pub const GITALY_CALLS_TOTAL: &str = "conman_gitaly_calls_total";
/// Gitaly gRPC call duration in seconds, labeled by method.
pub const GITALY_CALL_DURATION_SECONDS: &str = "conman_gitaly_call_duration_seconds";
/// Total authentication failures (bad password, expired token, etc.).
pub const AUTH_FAILURES_TOTAL: &str = "conman_auth_failures_total";
/// Total rate-limited requests.
pub const RATE_LIMITED_TOTAL: &str = "conman_rate_limited_total";3.2
HTTP Metrics Middleware
(conman-api/src/middleware/metrics.rs)
Records request count and duration for every HTTP request.
use axum::{extract::Request, middleware::Next, response::Response};
use metrics::{counter, histogram};
use std::time::Instant;
use crate::metrics::{HTTP_REQUESTS_TOTAL, HTTP_REQUEST_DURATION_SECONDS};
/// Middleware that records HTTP request count and latency.
///
/// Labels: `method`, `path` (matched route pattern, not raw URL), `status`.
/// Must be applied as an outer layer so it captures the full request lifecycle.
pub async fn http_metrics_middleware(req: Request, next: Next) -> Response {
let method = req.method().to_string();
// Use the matched path pattern (e.g. "/api/apps/:appId") to avoid
// high-cardinality labels from path parameters.
let path = req
.extensions()
.get::<axum::extract::MatchedPath>()
.map(|mp| mp.as_str().to_string())
.unwrap_or_else(|| "unknown".to_string());
let start = Instant::now();
let response = next.run(req).await;
let duration = start.elapsed().as_secs_f64();
let status = response.status().as_u16().to_string();
counter!(HTTP_REQUESTS_TOTAL, "method" => method.clone(), "path" => path.clone(), "status" => status);
histogram!(HTTP_REQUEST_DURATION_SECONDS, "method" => method, "path" => path).record(duration);
response
}3.3
Enhanced Health Check
(conman-api/src/handlers/health.rs)
Extends the E00 health endpoint with per-component status and version metadata.
use axum::extract::State;
use axum::http::StatusCode;
use axum::Json;
use serde::Serialize;
use crate::state::AppState;
/// Component-level health status.
#[derive(Debug, Clone, Serialize)]
#[serde(rename_all = "snake_case")]
pub enum ComponentStatus {
/// Component is reachable and responding within acceptable latency.
Healthy,
/// Component is reachable but degraded (e.g. high latency, replica lag).
Degraded,
/// Component is unreachable or returning errors.
Unhealthy,
}
/// Individual component health report.
#[derive(Debug, Clone, Serialize)]
pub struct ComponentHealth {
pub name: &'static str,
pub status: ComponentStatus,
/// Human-readable detail (e.g. "ping: 2ms", "connection refused").
#[serde(skip_serializing_if = "Option::is_none")]
pub detail: Option<String>,
}
/// Enhanced health response with component-level breakdown.
#[derive(Debug, Clone, Serialize)]
pub struct HealthResponse {
/// Overall status: "ok" if all components healthy, "degraded" otherwise.
pub status: &'static str,
/// Application version from compile-time env.
pub version: &'static str,
/// Individual component checks.
pub components: Vec<ComponentHealth>,
}
/// GET /api/health
///
/// Returns detailed health status for each dependency. Returns 200 when all
/// components are healthy, 503 when any component is unhealthy. Does not
/// require authentication.
pub async fn health_check(State(state): State<AppState>) -> (StatusCode, Json<HealthResponse>) {
let mut components = Vec::with_capacity(3);
let mut all_healthy = true;
// Check MongoDB connectivity.
let mongo_health = match check_mongo(&state).await {
Ok(detail) => ComponentHealth {
name: "mongodb",
status: ComponentStatus::Healthy,
detail: Some(detail),
},
Err(detail) => {
all_healthy = false;
ComponentHealth {
name: "mongodb",
status: ComponentStatus::Unhealthy,
detail: Some(detail),
}
}
};
components.push(mongo_health);
// Check Gitaly gRPC channel.
let gitaly_health = match check_gitaly(&state).await {
Ok(detail) => ComponentHealth {
name: "gitaly",
status: ComponentStatus::Healthy,
detail: Some(detail),
},
Err(detail) => {
all_healthy = false;
ComponentHealth {
name: "gitaly",
status: ComponentStatus::Unhealthy,
detail: Some(detail),
}
}
};
components.push(gitaly_health);
// Check job runner liveness.
let job_runner_health = match check_job_runner(&state).await {
Ok(detail) => ComponentHealth {
name: "job_runner",
status: ComponentStatus::Healthy,
detail: Some(detail),
},
Err(detail) => {
all_healthy = false;
ComponentHealth {
name: "job_runner",
status: ComponentStatus::Unhealthy,
detail: Some(detail),
}
}
};
components.push(job_runner_health);
let status = if all_healthy { "ok" } else { "degraded" };
let http_status = if all_healthy {
StatusCode::OK
} else {
StatusCode::SERVICE_UNAVAILABLE
};
(
http_status,
Json(HealthResponse {
status,
version: env!("CARGO_PKG_VERSION"),
components,
}),
)
}
/// Ping MongoDB and return round-trip time.
async fn check_mongo(state: &AppState) -> Result<String, String> {
let start = std::time::Instant::now();
conman_db::check_mongo_health(&state.db)
.await
.map(|_| format!("ping: {}ms", start.elapsed().as_millis()))
.map_err(|e| e.to_string())
}
/// Verify the Gitaly gRPC channel is connected.
async fn check_gitaly(state: &AppState) -> Result<String, String> {
match &state.gitaly_channel {
Some(_channel) => {
// Attempt a lightweight ServerInfo or similar RPC.
// For now, channel existence indicates the connection was established.
Ok("channel connected".to_string())
}
None => Err("channel not available".to_string()),
}
}
/// Verify the job runner is alive by checking its heartbeat timestamp.
async fn check_job_runner(state: &AppState) -> Result<String, String> {
// The job runner writes a heartbeat timestamp to a known MongoDB document.
// If the heartbeat is older than 60 seconds, the runner is considered unhealthy.
let _ = state;
// Implementation: query `job_runner_heartbeat` document from MongoDB.
// Placeholder -- will be wired when E06 job runner is available.
Ok("heartbeat current".to_string())
}3.4
Rate Limiter
(conman-api/src/middleware/rate_limit.rs)
Per-user rate limiting using a token bucket algorithm backed by an in-memory store. For single-instance v1 deployment this is sufficient; a Redis-backed implementation can replace the store later without changing the middleware.
use axum::{extract::Request, http::StatusCode, middleware::Next, response::Response};
use dashmap::DashMap;
use std::sync::Arc;
use std::time::{Duration, Instant};
use crate::response::{ApiError, ApiErrorBody};
use crate::request_context::RequestContext;
/// Configuration for the rate limiter.
#[derive(Debug, Clone)]
pub struct RateLimitConfig {
/// Maximum requests per window per user.
pub max_requests: u64,
/// Window duration.
pub window: Duration,
}
impl Default for RateLimitConfig {
fn default() -> Self {
Self {
max_requests: 100,
window: Duration::from_secs(60),
}
}
}
/// Per-user token bucket entry.
#[derive(Debug, Clone)]
struct BucketEntry {
remaining: u64,
window_start: Instant,
}
/// In-memory rate limit store. One entry per authenticated user.
///
/// Thread-safe via `DashMap`. Entries are lazily evicted when accessed
/// past their window.
#[derive(Debug, Clone)]
pub struct RateLimitStore {
buckets: Arc<DashMap<String, BucketEntry>>,
config: RateLimitConfig,
}
impl RateLimitStore {
pub fn new(config: RateLimitConfig) -> Self {
Self {
buckets: Arc::new(DashMap::new()),
config,
}
}
/// Attempt to consume one token for the given user.
/// Returns `Ok(remaining)` if allowed, `Err(())` if rate limited.
pub fn check(&self, user_id: &str) -> Result<u64, ()> {
let now = Instant::now();
let mut entry = self.buckets.entry(user_id.to_string()).or_insert_with(|| {
BucketEntry {
remaining: self.config.max_requests,
window_start: now,
}
});
// Reset window if expired.
if now.duration_since(entry.window_start) >= self.config.window {
entry.remaining = self.config.max_requests;
entry.window_start = now;
}
if entry.remaining == 0 {
return Err(());
}
entry.remaining -= 1;
Ok(entry.remaining)
}
}
/// Rate limiting middleware.
///
/// Extracts the authenticated user ID from request extensions (set by the
/// auth middleware). Unauthenticated requests are not rate-limited here
/// (they are rejected by the auth middleware first).
///
/// Returns 429 Too Many Requests when the limit is exceeded.
pub async fn rate_limit_middleware(
req: Request,
next: Next,
store: RateLimitStore,
) -> Response {
// Extract user ID from auth context if present.
let user_id = req
.extensions()
.get::<crate::auth::AuthUser>()
.map(|u| u.user_id.to_string());
if let Some(uid) = user_id {
match store.check(&uid) {
Ok(remaining) => {
let mut response = next.run(req).await;
// Attach rate limit headers for client awareness.
if let Ok(val) = remaining.to_string().parse() {
response.headers_mut().insert("X-RateLimit-Remaining", val);
}
response
}
Err(()) => {
metrics::counter!(
crate::metrics::RATE_LIMITED_TOTAL,
"user_id" => uid
);
let body = ApiError {
error: ApiErrorBody {
code: "rate_limited",
message: "Too many requests. Please wait and try again.".to_string(),
request_id: RequestContext::current_request_id(),
},
};
(StatusCode::TOO_MANY_REQUESTS, axum::Json(body)).into_response()
}
}
} else {
// No authenticated user -- skip rate limiting.
next.run(req).await
}
}4. Database
4.1 Index Review
E12 does not introduce new collections. Instead, it audits all existing indexes across every collection for query performance under load. The following index review must be completed and verified:
| Collection | Index | Purpose | Type |
|---|---|---|---|
apps |
{ name: 1 } |
App lookup by name | unique |
apps |
{ repo_path: 1 } |
App lookup by repo path | unique |
app_memberships |
{ user_id: 1, app_id: 1 } |
Membership lookup | unique compound |
app_memberships |
{ app_id: 1 } |
List members of an app | standard |
workspaces |
{ app_id: 1, owner_user_id: 1 } |
User's workspaces per app | compound |
workspaces |
{ app_id: 1, branch_name: 1 } |
Branch uniqueness per app | unique compound |
changesets |
{ app_id: 1, state: 1 } |
List changesets by state (queue view) | compound |
changesets |
{ workspace_id: 1, state: 1 } |
One open changeset per workspace | compound |
changesets |
{ app_id: 1, author_user_id: 1 } |
User's changesets per app | compound |
changeset_revisions |
{ changeset_id: 1, revision_number: 1 } |
Revision lookup | unique compound |
changeset_reviews |
{ changeset_id: 1 } |
Reviews for a changeset | standard |
changeset_comments |
{ changeset_id: 1, created_at: 1 } |
Paginated comment listing | compound |
release_batches |
{ app_id: 1, state: 1 } |
Releases by state | compound |
release_batches |
{ app_id: 1, tag: 1 } |
Release lookup by tag | unique compound |
release_changesets |
{ release_id: 1 } |
Changesets in a release | standard |
environments |
{ app_id: 1, name: 1 } |
Env name uniqueness per app | unique compound |
environments |
{ app_id: 1, position: 1 } |
Env position uniqueness per app | unique compound |
deployments |
{ app_id: 1, environment_id: 1, state: 1 } |
Active deployment lock per env | compound |
deployments |
{ release_id: 1 } |
Deployments for a release | standard |
temp_environments |
{ app_id: 1, state: 1 } |
Active temp envs per app | compound |
temp_environments |
{ expires_at: 1 } |
TTL expiry scan (job runner) | standard |
jobs |
{ state: 1, created_at: 1 } |
Job polling (FIFO by state) | compound |
jobs |
{ app_id: 1, type: 1, state: 1 } |
Job lookup by app and type | compound |
audit_events |
{ app_id: 1, occurred_at: -1 } |
Audit timeline per app | compound |
audit_events |
{ entity_type: 1, entity_id: 1 } |
Audit for specific entity | compound |
notification_preferences |
{ user_id: 1 } |
User prefs lookup | unique |
invites |
{ app_id: 1, email: 1 } |
Invite uniqueness per app | unique compound |
invites |
{ token: 1 } |
Invite acceptance lookup | unique |
Action items:
- Run
db.collection.getIndexes()for every collection and compare against the table above. - Run
db.collection.aggregate([{$indexStats:{}}])to identify unused indexes. - For collections expected to exceed 1M documents in production
(
audit_events,jobs,changeset_comments), verify that common query patterns use index scans, not collection scans. Useexplain("executionStats")on representative queries.
4.2 Read Preference and Write Concern
For production MongoDB replica set deployment:
| Operation type | Read preference | Write concern | Rationale |
|---|---|---|---|
| Health check ping | primaryPreferred |
n/a | Tolerate primary failover for health |
| Reads (listings, detail) | secondaryPreferred |
n/a | Spread read load; accept slight staleness |
| Writes (mutations) | n/a | { w: "majority", j: true } |
Durability: acknowledged by majority with journal |
| Job polling | primary |
{ w: "majority" } |
Avoid duplicate job pickup during failover |
| Audit event writes | n/a | { w: 1, j: false } |
Fire-and-forget; acceptable to lose rare event under crash |
These should be configured per-operation, not globally, using
the MongoDB driver's ReadPreference and
WriteConcern options on individual collection handles
or operation options.
4.3 Backup Strategy
- Frequency: Automated daily full backup via
mongodumpor cloud provider snapshot (Atlas continuous backup if using Atlas). - Retention: 30 days of daily backups, 7 days of oplog for point-in-time recovery.
- Restore testing: Monthly restore drill to a staging environment. Document restore time (target: < 30 minutes for databases under 10 GB).
- Oplog sizing: Ensure oplog window covers at least 24 hours of write activity so replica resync does not require full initial sync.
5. API Endpoints
5.1
GET /api/health (enhanced)
Replaces the E00 basic health check with component-level status.
| Attribute | Value |
|---|---|
| Auth | None (public) |
| Rate limit | Exempt |
Response 200 (all components healthy):
{
"status": "ok",
"version": "0.1.0",
"components": [
{ "name": "mongodb", "status": "healthy", "detail": "ping: 2ms" },
{ "name": "gitaly", "status": "healthy", "detail": "channel connected" },
{ "name": "job_runner", "status": "healthy", "detail": "heartbeat current" }
]
}Response 503 (one or more components unhealthy):
{
"status": "degraded",
"version": "0.1.0",
"components": [
{ "name": "mongodb", "status": "healthy", "detail": "ping: 3ms" },
{ "name": "gitaly", "status": "unhealthy", "detail": "channel not available" },
{ "name": "job_runner", "status": "healthy", "detail": "heartbeat current" }
]
}5.2
GET /api/metrics (Prometheus scrape endpoint)
| Attribute | Value |
|---|---|
| Auth | None (should be network-restricted in production via firewall/ingress rules) |
| Rate limit | Exempt |
| Content-Type | text/plain; version=0.0.4; charset=utf-8 |
Response 200:
# HELP conman_http_requests_total Total HTTP requests received.
# TYPE conman_http_requests_total counter
conman_http_requests_total{method="GET",path="/api/apps",status="200"} 1042
conman_http_requests_total{method="POST",path="/api/apps/:appId/changesets",status="201"} 87
# HELP conman_http_request_duration_seconds HTTP request duration.
# TYPE conman_http_request_duration_seconds histogram
conman_http_request_duration_seconds_bucket{method="GET",path="/api/apps",le="0.1"} 980
...
# HELP conman_job_queue_depth Current queued jobs.
# TYPE conman_job_queue_depth gauge
conman_job_queue_depth{type="revalidate_queued_changeset"} 3
conman_job_queue_depth{type="deploy_release"} 0
...
Handler implementation:
use axum::response::IntoResponse;
use metrics_exporter_prometheus::PrometheusHandle;
/// GET /api/metrics
///
/// Returns metrics in Prometheus text exposition format. Not authenticated
/// -- restrict access via network policy in production.
pub async fn metrics_endpoint(
State(handle): State<PrometheusHandle>,
) -> impl IntoResponse {
handle.render()
}6. Business Logic
6.1 Load Test Scenarios
All load tests use a dedicated test environment with a populated MongoDB and a Gitaly instance backed by realistic repository data (not empty repos).
| # | Scenario | Parameters | Target | Tool |
|---|---|---|---|---|
| L1 | Concurrent file edits | 50 users, each editing 5 files in their own workspace | All 250 edits succeed within 2s per request | k6 or drill |
| L2 | Changeset submission storm | 50 concurrent changeset submissions, each triggering
msuite_submit job |
All 50 submissions accepted, jobs enqueued within 1s | k6 |
| L3 | Queue with 100+ changesets | Seed 150 queued changesets, then publish a release of 10 | Post-publish revalidation of remaining 140 completes within 10 minutes | custom Rust test harness |
| L4 | Rapid release cycle | 5 releases published sequentially with 60s gap, each with 5 changesets | No data corruption, all revalidation loops complete, no orphaned jobs | custom Rust test harness |
| L5 | Large repository operations | Repository with 10,000+ files across 500+ directories. Perform tree listing, file read, diff operations | Tree listing < 3s, single file read < 500ms, diff < 5s | k6 |
| L6 | Deployment pipeline | 10 concurrent deploy requests across different environments for different apps | Each deployment runs to completion, environment locks enforced, no double-deploys | k6 |
| L7 | API listing under load | 50 concurrent requests to
GET /api/apps/:appId/changesets?state=queued with 500
changesets in DB |
p99 response time < 500ms, no timeouts | k6 |
6.2 Fault Injection Scenarios
| # | Scenario | Injection method | Expected behavior |
|---|---|---|---|
| F1 | Gitaly connection drop | Kill gitaly process (or iptables drop) mid-request | API returns 502 git_error. Retry logic in
GitalyClient attempts 3 retries with backoff. Non-git
endpoints remain operational. Health endpoint reports gitaly
unhealthy. |
| F2 | Gitaly slow response | Inject 10s delay on gitaly responses (tc netem or proxy) | Requests with git operations time out at configured deadline (default 30s). Client receives 504 or 502. Non-git endpoints unaffected. |
| F3 | MongoDB primary failover | rs.stepDown() on primary |
Writes fail briefly during election (typically 2-10s). Health endpoint returns 503 during election. After new primary elected, operations resume automatically. No data loss for majority-acknowledged writes. |
| F4 | MongoDB full outage | Stop all replica set members | All API endpoints return 500/503. Health endpoint returns 503 with mongodb unhealthy. Server does not crash. Recovery is automatic when MongoDB comes back. |
| F5 | Job worker crash mid-execution | kill -9 the process while a job is in
running state |
Job remains in running state with stale
locked_until. Job runner picks it up after lock
expiry (configurable, default 5 minutes). Job is retried.
Idempotency ensures no duplicate side effects. |
| F6 | Job worker crash during revalidation storm | Kill worker while 50+ revalidation jobs are in progress | All in-progress jobs are re-picked after lock expiry. Remaining queued jobs are processed normally. No changeset is left in an inconsistent state. |
| F7 | Network partition between API and job runner | Block network between API server and MongoDB for job runner only | API continues serving reads from secondary. Job runner stops picking jobs. Health endpoint shows job_runner degraded. When partition heals, job runner resumes. |
6.3 SLO Definitions
These SLOs apply to the production deployment. They are measured over a rolling 30-day window.
| SLO | Metric | Target | Measurement |
|---|---|---|---|
| API availability | Successful (non-5xx) responses / total responses | >= 99.9% | Prometheus:
rate(conman_http_requests_total{status!~"5.."}[30d]) / rate(conman_http_requests_total[30d]) |
| API latency (p99) | 99th percentile response time for non-background endpoints | < 500ms | Prometheus:
histogram_quantile(0.99, rate(conman_http_request_duration_seconds_bucket[5m])) |
| Job processing (p99) | 99th percentile time from job enqueue to completion | < 30s | Custom metric: conman_job_duration_seconds |
| Job processing (p99) for deployments | 99th percentile deploy job duration | < 120s | conman_job_duration_seconds{type="deploy_release"} |
| Deployment success rate | Succeeded deployments / total non-canceled deployments | >= 99% | Prometheus:
rate(conman_deployments_total{outcome="succeeded"}[30d]) / rate(conman_deployments_total{outcome!="canceled"}[30d]) |
| Revalidation turnaround | Time from release publish to all queued changeset revalidations complete | < 10 minutes for 100 queued changesets | Custom metric with event timestamps |
Alert thresholds (Prometheus alerting rules):
| Alert | Condition | Severity | Action |
|---|---|---|---|
ConmanHighErrorRate |
5xx rate > 1% over 5 minutes | P1 | Page on-call |
ConmanHighLatency |
p99 latency > 1s over 5 minutes | P2 | Notify on-call |
ConmanJobQueueBacklog |
conman_job_queue_depth > 50 for any type for
10 minutes |
P2 | Notify on-call, check job runner health |
ConmanJobRunnerDown |
Job runner heartbeat stale > 2 minutes | P1 | Page on-call, restart job runner |
ConmanGitalyUnhealthy |
Health check reports gitaly unhealthy for 2 minutes | P1 | Page on-call, check gitaly process |
ConmanMongoUnhealthy |
Health check reports mongodb unhealthy for 1 minute | P1 | Page on-call, check replica set |
ConmanDeploymentFailure |
Any deployment enters failed state |
P2 | Notify config manager and on-call |
ConmanTempEnvLeaking |
Temp environments in expired state with
grace_until in the past > 1 hour |
P3 | Investigate cleanup job |
6.4 Rate Limiting
Per-user rate limits applied after authentication middleware:
| Scope | Limit | Window | Notes |
|---|---|---|---|
| Global per-user | 100 requests | 60 seconds | Applies to all authenticated endpoints |
| Write endpoints (POST/PUT/PATCH/DELETE) | 30 requests | 60 seconds | Prevents mutation storms |
Auth endpoints (/api/auth/*) |
10 requests | 60 seconds | Brute-force protection (per IP, not per user) |
Rate limit response (HTTP 429):
{
"error": {
"code": "rate_limited",
"message": "Too many requests. Please wait and try again.",
"request_id": "018f2f35-2e63-7b3b-b5e1-9f0d3a2c4b10"
}
}Response headers on all authenticated requests:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1719849600
6.5 Security Checklist
Every item must be verified with a passing test before launch.
Authentication:
| # | Item | Requirement | Verification |
|---|---|---|---|
| S1 | Password minimum length | >= 12 characters | Unit test in conman-auth |
| S2 | Password hashing algorithm | Argon2id with m_cost=19456,
t_cost=2, p_cost=1 (OWASP
recommendation) |
Unit test verifying hash format |
| S3 | Password hash timing | Verification takes 100-500ms (prevents timing attacks while maintaining UX) | Benchmark test |
| S4 | JWT expiry | 24 hours (configurable via
CONMAN_JWT_EXPIRY_HOURS) |
Integration test: token issued, wait (or mock time), verify rejection |
| S5 | JWT secret strength | Minimum 32 bytes, validated at startup | Startup validation in Config::from_env() |
| S6 | Invite token expiry | 7 days (configurable via
CONMAN_INVITE_EXPIRY_DAYS) |
Integration test: expired invite rejected |
| S7 | Password reset token | Single-use, 1-hour expiry | Integration test: used token rejected on second use |
| S8 | Failed login throttling | After 5 failed attempts for same email, enforce 15-minute cooldown | Integration test |
Authorization (RBAC):
| # | Item | Requirement | Verification |
|---|---|---|---|
| S9 | user cannot approve changeset |
Returns 403 | Integration test |
| S10 | user cannot assemble release |
Returns 403 | Integration test |
| S11 | user cannot deploy |
Returns 403 | Integration test |
| S12 | reviewer cannot assemble release |
Returns 403 | Integration test |
| S13 | reviewer cannot manage settings |
Returns 403 | Integration test |
| S14 | Non-member cannot access app | Returns 403 | Integration test |
| S15 | Role escalation via API | Sending role: "app_admin" in membership update as
non-admin returns 403 |
Integration test |
| S16 | Cross-app access | User with role on app A cannot access app B resources | Integration test |
Input validation:
| # | Item | Requirement | Verification |
|---|---|---|---|
| S17 | NoSQL injection in query params | ?name[$gt]= and similar MongoDB operator
injection attempts are rejected or sanitized |
Integration test |
| S18 | NoSQL injection in JSON body | {"name": {"$gt": ""}} rejected by type-safe
deserialization (serde rejects objects where String expected) |
Unit test |
| S19 | Path traversal in file operations | ../../etc/passwd and similar traversal in file
path parameter is blocked |
Integration test |
| S20 | Path traversal with encoded chars | ..%2F..%2Fetc%2Fpasswd is blocked |
Integration test |
| S21 | Blocked path enforcement | Editing .git/config or
.github/workflows/ci.yml returns 403 |
Integration test |
| S22 | File size limit enforcement | Upload exceeding file_size_limit_bytes returns
400 |
Integration test |
| S23 | Request body size limit | Request bodies > 10 MB rejected at middleware level | Integration test |
| S24 | XSS in changeset comments | HTML/script tags in comment body are stored as-is (no execution context in API-only backend) but validated for max length | Unit test |
| S25 | Branch name injection | Workspace branch names cannot contain .., leading
-, or shell metacharacters |
Unit test |
7. Gitaly-rs Integration
7.1 Connection Resilience Testing
The GitalyClient retry logic (introduced in E01)
must be validated under adversarial conditions:
| Test | Setup | Expected |
|---|---|---|
Retry on UNAVAILABLE |
Mock gitaly returns UNAVAILABLE twice, then
success |
Operation succeeds after 3rd attempt |
Retry on DEADLINE_EXCEEDED |
Mock gitaly returns DEADLINE_EXCEEDED once, then
success |
Operation succeeds after 2nd attempt |
No retry on NOT_FOUND |
Mock gitaly returns NOT_FOUND |
Operation fails immediately, no retry |
No retry on INVALID_ARGUMENT |
Mock gitaly returns INVALID_ARGUMENT |
Operation fails immediately, no retry |
| Max retries exhausted | Mock gitaly returns UNAVAILABLE 4 times |
Operation fails after 3 retries |
| Backoff timing | Mock gitaly returns UNAVAILABLE 3 times, measure
delays |
Delays follow exponential backoff: ~100ms, ~200ms, ~400ms (+/- jitter) |
| Channel reconnect after restart | Stop gitaly, wait 5s, restart gitaly, make request | Request succeeds (Tonic channel reconnects automatically) |
7.2 Timeout Configuration
| Operation | Recommended timeout | Rationale |
|---|---|---|
RepositoryExists /
CreateRepository |
5s | Lightweight metadata operations |
TreeEntry / GetBlob |
10s | File reads scale with file size |
CommitDiff |
30s | Diffs on large changesets can be expensive |
UserCommitFiles |
30s | Streaming writes for multi-file commits |
MergeToRef / UserMergeBranch |
60s | Merge operations on large repos may be slow |
These timeouts should be configurable via environment variables:
CONMAN_GITALY_TIMEOUT_DEFAULT=10s
CONMAN_GITALY_TIMEOUT_DIFF=30s
CONMAN_GITALY_TIMEOUT_MERGE=60s
8. Implementation Checklist
This epic is test-heavy and configuration-focused. Steps are organized by sub-issue rather than sequential commits.
E12-01: Load and Performance Testing
E12-02: Fault Injection Testing
E12-03: SLOs and Operational Dashboards
-
- Request rate and error rate
- p50/p95/p99 latency
- Job queue depth over time
- Job processing duration by type
- Deployment success/failure rate
- Gitaly call latency
E12-04: Runbooks
All runbooks written as markdown in
docs/runbooks/. Each follows the template:
Trigger, Impact,
Diagnosis, Resolution,
Prevention.
-
- Trigger:
release_assemblejob fails - Steps: identify conflicting changeset, mark conflicted, retry without it
- Trigger:
-
- Trigger:
ConmanJobQueueBacklogalert fires after a release publish - Steps: check job runner health, scale if needed, monitor queue drain rate, pause revalidation if queue > 500 to prevent cascading load
- Trigger:
-
- Trigger:
ConmanTempEnvLeakingalert fires - Steps: check
temp_env_expirejobs, manually expire stuck envs, verify grace period logic
- Trigger:
-
- Trigger:
ConmanGitalyUnhealthyalert fires - Steps: check gitaly process, check disk space, check gRPC connectivity, restart if needed, verify API recovery
- Trigger:
-
- Trigger:
ConmanMongoUnhealthyalert fires - Steps: check replica set status, verify new primary elected, check for write concern errors in logs, verify application recovery
- Trigger:
-
- Trigger:
ConmanDeploymentFailurealert fires - Steps: check deployment logs, identify root cause, decide between retry and rollback, execute remediation
- Trigger:
-
- Trigger: spike in
conman_auth_failures_total - Steps: check for brute-force attempts, verify JWT secret unchanged, check token expiry clock skew
- Trigger: spike in
E12-05: Security Hardening
9. Test Cases
9.1 Load test: 50 concurrent users editing files
// k6 script: tests/load/concurrent_edits.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
vus: 50,
duration: '2m',
thresholds: {
http_req_duration: ['p(99)<2000'], // p99 < 2s
http_req_failed: ['rate<0.01'], // < 1% failure rate
},
};
export default function () {
const userId = __VU;
const workspaceId = `ws-${userId}`;
const fileName = `config/service-${userId}/settings.json`;
const res = http.put(
`${__ENV.BASE_URL}/api/apps/${__ENV.APP_ID}/workspaces/${workspaceId}/files`,
JSON.stringify({ path: fileName, content: `{"vu": ${userId}, "iter": ${__ITER}}` }),
{ headers: { 'Authorization': `Bearer ${__ENV.TOKEN}`, 'Content-Type': 'application/json' } }
);
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 2s': (r) => r.timings.duration < 2000,
});
sleep(1);
}9.2 Fault test: gitaly goes down, API returns 503 gracefully
#[tokio::test]
async fn gitaly_down_returns_502_for_git_operations() {
// Start mock gitaly that immediately drops connections.
let mock_gitaly = MockGitalyServer::start_refusing_connections().await;
let app = test_app_with_gitaly(mock_gitaly.address()).await;
// Non-git endpoint should still work.
let health_res = app
.oneshot(Request::builder().uri("/api/health").body(Body::empty()).unwrap())
.await
.unwrap();
// Health endpoint reports gitaly as unhealthy but responds 503 overall.
assert_eq!(health_res.status(), StatusCode::SERVICE_UNAVAILABLE);
let body: serde_json::Value = parse_body(health_res).await;
assert_eq!(body["status"], "degraded");
let gitaly_component = body["components"]
.as_array()
.unwrap()
.iter()
.find(|c| c["name"] == "gitaly")
.unwrap();
assert_eq!(gitaly_component["status"], "unhealthy");
// Git operation should return 502.
let file_res = app_clone
.oneshot(
Request::builder()
.uri("/api/apps/test-app/workspaces/ws-1/files?path=config.json")
.header("Authorization", "Bearer valid-token")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(file_res.status(), StatusCode::BAD_GATEWAY);
let err_body: serde_json::Value = parse_body(file_res).await;
assert_eq!(err_body["error"]["code"], "git_error");
}9.3 Fault test: MongoDB primary failover, operations resume
#[tokio::test]
async fn mongo_failover_recovers_automatically() {
// Requires a 3-node replica set (testcontainers or local).
let rs = MongoReplicaSet::start(3).await;
let app = test_app_with_mongo(rs.connection_string()).await;
// Verify initial connectivity.
let res = app.clone().oneshot(health_request()).await.unwrap();
assert_eq!(res.status(), StatusCode::OK);
// Force primary step-down.
rs.step_down_primary().await;
// Requests may fail briefly during election.
tokio::time::sleep(Duration::from_secs(2)).await;
// After election completes, operations should resume.
// Retry for up to 15 seconds.
let mut recovered = false;
for _ in 0..15 {
let res = app.clone().oneshot(health_request()).await.unwrap();
if res.status() == StatusCode::OK {
recovered = true;
break;
}
tokio::time::sleep(Duration::from_secs(1)).await;
}
assert!(recovered, "Server did not recover after MongoDB failover within 15s");
}9.4 Fault test: job worker crashes, job re-picked on restart
#[tokio::test]
async fn crashed_job_is_retried_after_lock_expiry() {
let db = test_mongo_db().await;
let job_repo = JobRepo::new(db.clone());
// Insert a job in "running" state with a stale lock (expired 1 minute ago).
let stale_job = Job {
id: ObjectId::new().to_hex(),
job_type: "msuite_submit".to_string(),
state: "running".to_string(),
locked_until: Some(Utc::now() - Duration::from_secs(60)),
attempts: 1,
max_attempts: 3,
..test_job()
};
job_repo.insert(&stale_job).await.unwrap();
// Start a new job runner instance.
let runner = JobRunner::new(db.clone(), mock_workers());
let picked = runner.poll_next_job().await.unwrap();
// The stale job should be picked up for retry.
assert!(picked.is_some());
assert_eq!(picked.unwrap().id, stale_job.id);
assert_eq!(picked.unwrap().attempts, 2);
}9.5 Security test: NoSQL injection attempts blocked
#[tokio::test]
async fn nosql_injection_in_query_param_rejected() {
let app = test_app().await;
// Attempt MongoDB operator injection via query parameter.
let res = app
.oneshot(
Request::builder()
.uri("/api/apps?name[$gt]=")
.header("Authorization", "Bearer valid-token")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
// Should not return all apps -- either 400 (bad param) or empty results.
// Must NOT return documents where name > "".
assert!(
res.status() == StatusCode::BAD_REQUEST || {
let body: serde_json::Value = parse_body(res).await;
body["data"].as_array().map_or(true, |arr| arr.is_empty())
}
);
}
#[tokio::test]
async fn nosql_injection_in_json_body_rejected() {
let app = test_app().await;
// Attempt operator injection in JSON body.
// Serde's typed deserialization rejects objects where a String is expected.
let res = app
.oneshot(
Request::builder()
.method("POST")
.uri("/api/apps")
.header("Authorization", "Bearer admin-token")
.header("Content-Type", "application/json")
.body(Body::from(r#"{"name": {"$gt": ""}, "repo_path": "test.git"}"#))
.unwrap(),
)
.await
.unwrap();
// Serde deserialization fails: name expects a string, not an object.
assert_eq!(res.status(), StatusCode::BAD_REQUEST);
}9.6 Security test: path traversal in file operations blocked
#[tokio::test]
async fn path_traversal_blocked() {
let app = test_app_with_workspace().await;
let traversal_paths = vec![
"../../etc/passwd",
"..%2F..%2Fetc%2Fpasswd",
"config/../../../etc/shadow",
"/etc/passwd",
"config/../../.git/config",
];
for path in traversal_paths {
let res = app
.clone()
.oneshot(
Request::builder()
.uri(&format!(
"/api/apps/test-app/workspaces/ws-1/files?path={}",
path
))
.header("Authorization", "Bearer valid-token")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert!(
res.status() == StatusCode::BAD_REQUEST || res.status() == StatusCode::FORBIDDEN,
"Path traversal not blocked for: {path}"
);
}
}9.7 Security test: expired JWT rejected
#[tokio::test]
async fn expired_jwt_rejected() {
let app = test_app().await;
// Generate a JWT that expired 1 hour ago.
let expired_token = issue_test_jwt(
"user@example.com",
Utc::now() - chrono::Duration::hours(1),
);
let res = app
.oneshot(
Request::builder()
.uri("/api/apps")
.header("Authorization", format!("Bearer {expired_token}"))
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::UNAUTHORIZED);
let body: serde_json::Value = parse_body(res).await;
assert_eq!(body["error"]["code"], "unauthorized");
}9.8 Security test: role escalation attempts blocked
#[tokio::test]
async fn user_cannot_escalate_own_role() {
let app = test_app_with_membership("user@test.com", Role::User).await;
// Attempt to update own role to app_admin via the membership API.
let res = app
.oneshot(
Request::builder()
.method("PATCH")
.uri("/api/apps/test-app/members/self")
.header("Authorization", "Bearer user-token")
.header("Content-Type", "application/json")
.body(Body::from(r#"{"role": "app_admin"}"#))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::FORBIDDEN);
}
#[tokio::test]
async fn reviewer_cannot_manage_settings() {
let app = test_app_with_membership("reviewer@test.com", Role::Reviewer).await;
let res = app
.oneshot(
Request::builder()
.method("PATCH")
.uri("/api/apps/test-app/settings")
.header("Authorization", "Bearer reviewer-token")
.header("Content-Type", "application/json")
.body(Body::from(r#"{"baseline_mode": "integration_head"}"#))
.unwrap(),
)
.await
.unwrap();
assert_eq!(res.status(), StatusCode::FORBIDDEN);
}10. Acceptance Criteria
Go-Live Checklist
Every item must be verified and signed off before production deployment.
Performance:
Resilience:
Observability:
Operational:
Security:
No P0 blockers remaining in the issue tracker.