Documentation Index
Fetch the complete documentation index at: https://docs.strait.dev/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Docker running (for test job containers)
- Go 1.26+
- 8GB+ RAM free
- The Strait source code
Quick Start (15 minutes)
Build the test job images and run a quick validation:
# Build test job images
cd packages/load-tests && make build && cd ..
# Run quick validation (finds approximate throughput ceiling)
cd apps/strait
LOADTEST_QUICK=true go test -tags=loadtest -run TestQuickValidation \
-timeout 15m ./internal/loadtest/...
Test Job Images
The framework includes real workloads in Python, TypeScript, and Go:
| Image | Language | What It Does |
|---|
strait-loadtest-python | Python 3.12 | Fast processing, CPU-intensive work, AI agent simulation |
strait-loadtest-ts | TypeScript (Node 22) | Data pipeline with 10K record transform |
strait-loadtest-go | Go 1.26 | Memory allocation for OOM testing |
strait-loadtest-errors | Python 3.12 | 12 failure scenarios (OOM, segfault, infinite loop, etc.) |
Build all images:
cd packages/load-tests && make build
Full Test Suite
Tier 1: Throughput Ceiling
Finds the maximum sustained jobs/sec. Starts at 10 jobs/sec, increases by 10 every 60 seconds until the system breaks.
go test -tags=loadtest -run TestThroughputCeiling -timeout 2h ./internal/loadtest/...
Stop conditions: queue depth > 10K, P99 latency > 5s, or error rate > 1%.
Tier 2: Concurrency Ceiling
Finds the maximum concurrent connections. Starts at 50 concurrent, increases by 50 every 2 minutes.
go test -tags=loadtest -run TestConcurrencyCeiling -timeout 1h ./internal/loadtest/...
Tier 3: Multi-Tenant Simulation
Simulates real production with hundreds of tenants, mixed plans, and varied traffic patterns.
# 500 tenants, 4 hours
go test -tags=loadtest -run TestProductionSimulation -timeout 6h ./internal/loadtest/...
# 2,000 tenants, 8 hours
LOADTEST_TENANTS=2000 LOADTEST_DURATION=8h \
go test -tags=loadtest -run TestProductionSimulation -timeout 10h ./internal/loadtest/...
Tier 3: Breaking Point
Adds 100 tenants every 30 minutes until performance degrades.
go test -tags=loadtest -run TestBreakingPoint -timeout 12h ./internal/loadtest/...
Tier 4: Endurance (24 hours)
Runs at 70% of throughput ceiling for 24 hours. Detects memory leaks, goroutine leaks, and performance drift.
LOADTEST_DURATION=24h go test -tags=loadtest -run TestEndurance -timeout 26h ./internal/loadtest/...
Tier 5: Chaos Engineering
Breaks things on purpose during production load. 8 scenarios: worker kill, database failover, Redis failure, Docker restart, connection pool exhaustion, disk pressure, clock skew, cascading failure.
go test -tags=loadtest -run TestChaosAll -timeout 4h ./internal/loadtest/...
Error Scenarios
Tests all 12 failure modes: clean exit, exit codes, OOM, segfault, infinite loop, slow death, checkpoint recovery, SDK timeout, fork bomb, disk fill, network abuse.
go test -tags=loadtest -run TestErrorScenarios -timeout 1h ./internal/loadtest/...
Generating Reports
After running tests, generate HTML and JSON reports:
go run -tags=loadtest ./internal/loadtest/cmd/report \
-input loadtest-results/latest/ \
-html report.html -json report.json
The HTML report includes:
- Executive summary with key metrics
- Throughput and concurrency ramp tables
- Multi-tenant simulation results
- Chaos engineering verdicts
- Error scenario pass/fail matrix
Grafana Dashboard
The load test environment includes a pre-configured Grafana dashboard for real-time monitoring.
Setup
# Start the full load test stack
cd apps/strait
docker compose -f docker-compose.loadtest.yml up -d
# Open Grafana (default: admin/admin)
open http://localhost:3001
The dashboard shows:
- Queue depth and active workers
- Throughput and dispatch latency (P50/P95/P99)
- Error rates and worker pool utilization
- Database connection pool breakdown
- Webhook delivery metrics
- Go runtime (goroutines, heap memory, GC pauses)
Prometheus scrapes Strait’s /metrics endpoint every 5 seconds, so panels update in near real-time during load tests.
Understanding Your Results
What each metric means
| Metric | Good | Warning | Action |
|---|
| Max throughput | > 1,000/sec | < 500/sec | Check DB connection pool, query optimization |
| P99 latency | < 500ms | > 2s | Profile hot paths, check indexes |
| Error rate | < 0.01% | > 0.1% | Check logs for root cause |
| Memory trend | Flat over 24h | Linear growth | Check for goroutine or connection leaks |
| Queue depth | Returns to 0 | Growing | Worker count too low or dequeue too slow |
Environment variables
| Variable | Default | Description |
|---|
LOADTEST_STRAIT_URL | http://localhost:8080 | Strait API URL |
LOADTEST_INTERNAL_SECRET | $INTERNAL_SECRET | API authentication secret |
LOADTEST_DATABASE_URL | $DATABASE_URL | PostgreSQL connection for metrics |
LOADTEST_REDIS_URL | $REDIS_URL | Redis connection for metrics |
LOADTEST_QUICK | - | Set to true for 15-min quick validation |
LOADTEST_TENANTS | 500 | Tenant count for production simulation |
LOADTEST_DURATION | 4h | Duration for simulation/endurance tests |
LOADTEST_TARGET_RATE | auto | Override target rate for endurance tests |
Tuning Based on Results
| Bottleneck | Symptom | Fix |
|---|
| Queue throughput | Queue depth growing, dequeue rate flat | Increase DB_MAX_CONNS (default: 50), optimize dequeue query |
| Concurrent limit | Errors spike at N concurrent | Increase WORKER_CONCURRENCY |
| Memory growth | RSS increases linearly over 24h | Check for leaked goroutines, unclosed connections |
| Webhook delivery | Webhook latency spiking | Increase WEBHOOK_CONCURRENCY, check endpoint health |
| Database connections | wait_count increasing | Increase DB_MAX_CONNS (default: 50), add connection pooler |
The load test harness uses HTTP keep-alives with connection pooling for realistic measurements that match production client behavior.
| Redis memory | Used memory > maxmemory | Increase maxmemory, review eviction policy |