This page is the implementation-level reference for Strait’s workflow DAG engine. If you need the high-level conceptual overview, start with Workflows. This document focuses on what the runtime does today, the exact controls that exist, and how to operate and debug DAG runs safely. For step-by-step incident handling and operator runbooks, see DAG Operations Playbook.Documentation Index
Fetch the complete documentation index at: https://docs.strait.dev/llms.txt
Use this file to discover all available pages before exploring further.
What is implemented today
Strait DAG execution currently includes:- DAG validation and cycle rejection before execution
- Multi-parent fan-in with atomic dependency counters
- Per-step condition evaluation and skip decisions
- Step-level concurrency control (
concurrency_key) - Resource-class scheduling limits (
small,medium,large) - Human approvals, event waits, durable sleep, and sub-workflow steps
- Project-level workflow policies (
max_fan_out,max_depth, forbidden step types, deploy approval requirement) - Explainability stream (
workflow_step_decisions) with API querying - Runtime graph introspection with critical-path estimates
- Branch-local recovery operations (
retry step,replay subtree) - Stalled workflow reconciliation via scheduler reaper policy
- Compensating transactions with reverse-order rollback handlers on step failure
- Expected completion tracking via critical-path analysis through the DAG
- Stage notifications on step completion, failure, or skip transitions
- Canary deployments with weighted version routing and auto-promote/rollback
- Workflow simulation with dry-run, sandbox, and failure-injection modes
- Visual debug view with per-step inspection, data flow, and cross-run comparison
DAG runtime model
A workflow run materializes as manyworkflow_step_runs as there are steps in the selected workflow version snapshot.
Key runtime fields:
deps_required: number of direct dependencies declared by the stepdeps_completed: number of completed/advanced dependencies observed so farstatus:pending|waiting|running|completed|failed|skipped|canceledattempt: step-level attempt counter
Scheduling and fan-in mechanics
Atomic fan-in
When a parent step completes, the engine increments dependency counters for its children with atomic SQL updates. A child becomes runnable when:deps_completed == deps_required- and it is not already terminal/running
Targeted scheduling reads
The scheduler uses targeted read sets per workflow run:- statuses map:
step_ref -> status - currently running step runs
- currently runnable step runs
Scheduling blockers (recorded decisions)
Before starting a runnable step, the scheduler checks:- Workflow-level parallelism cap:
max_parallel_steps - Step-level serialization:
concurrency_key - Resource-class capacity (
resource_class) - Step condition evaluation
workflow_step_decisions for explainability.
Resource classes
resource_class is a step-level scheduling hint and quota bucket.
Current runtime limits are enforced in scheduler logic:
small: 50 concurrent running stepsmedium: 20large: 5
small.
Condition DSL
The condition evaluator currently supports:step_statusstep_status_inall_ofany_ofnoteq,negt,gte,lt,ltecontainsinregexexists
skipped and progression continues.
Workflow policy controls
Project policies are configured via:PUT /v1/workflow-policies/{projectID}GET /v1/workflow-policies/{projectID}
max_fan_outmax_depthforbidden_step_types[]require_approval_for_deploy
- create workflow
- update workflow
- trigger workflow
Explainability APIs
Step decision stream
GET /v1/workflow-runs/{workflowRunID}/explain
Optional filters:
step_refdecision_type
decision, explanation, details, timestamp) so operators can answer: why did this step wait or skip?
Runtime graph + critical path
GET /v1/workflow-runs/{workflowRunID}/graph
Returns:
- nodes (state/timing per step)
- edges (dependency links)
- roots
- runnable set
critical_pathcritical_path_estimate_mscritical_path_remaining_ms
- terminal steps: use observed runtime duration
- running steps: use elapsed time so far
- pending/waiting steps: use
timeout_secs_overridewhen present
Runtime recovery APIs
Retry one step
POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry
Behavior:
- requires target step run to be terminal
- resets step run to
pending - clears started/finished/error/output/event_key
- resumes workflow progression
Replay a subtree
POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree
Behavior:
- computes descendants from workflow version DAG
- resets selected step + descendants to
pending - resumes workflow progression
Version insight APIs
Diff versions
GET /v1/workflows/{workflowID}/versions/{fromVersionID}/diff/{toVersionID}
Returns step refs added/removed between versions.
Version impact
GET /v1/workflows/{workflowID}/versions/{versionID}/impact
Returns how many sampled runs match the requested workflow version.
Simulate snapshot
POST /v1/workflows/{workflowID}/simulate
Returns snapshot-level ordering metadata (predicted_order, step_count) for the current workflow version.
API examples (runtime and recovery)
Explain a blocked or skipped step
Inspect graph and critical path
Retry a terminal step
Replay a failed subtree
Upsert project policy
Reaper-driven DAG safety nets
The scheduler reaper includes stalled workflow detection (WF_STALL_THRESHOLD) with configurable action (WF_STALL_ACTION):
log_onlyreconcile(resume callback)fail_workflow
Known operational limits
- Max workflow steps per definition: 1000
concurrency_keymax length: 128 charsevent_keymax length: 512 chars- Sub-workflow default nesting depth limit: 10
Recommended operator workflow
When a workflow appears stuck:- Check graph:
/workflow-runs/{id}/graph - Check decisions:
/workflow-runs/{id}/explain - Inspect policy constraints for the project
- Retry a single step or replay subtree if failure is localized
- Use stalled-workflow reaper action
reconcilein environments where auto-healing is preferred