DAG Runtime

This page is the implementation-level reference for Strait’s workflow DAG engine. If you need the high-level conceptual overview, start with Workflows. This document focuses on what the runtime does today, the exact controls that exist, and how to operate and debug DAG runs safely. For step-by-step incident handling and operator runbooks, see DAG Operations Playbook.

What is implemented today

Strait DAG execution currently includes:

DAG validation and cycle rejection before execution
Multi-parent fan-in with atomic dependency counters
Per-step condition evaluation and skip decisions
Step-level concurrency control (concurrency_key)
Resource-class scheduling limits (small, medium, large)
Human approvals, event waits, durable sleep, and sub-workflow steps
Project-level workflow policies (max_fan_out, max_depth, forbidden step types, deploy approval requirement)
Explainability stream (workflow_step_decisions) with API querying
Runtime graph introspection with critical-path estimates
Branch-local recovery operations (retry step, replay subtree)
Stalled workflow reconciliation via scheduler reaper policy
Compensating transactions with reverse-order rollback handlers on step failure
Expected completion tracking via critical-path analysis through the DAG
Stage notifications on step completion, failure, or skip transitions
Canary deployments with weighted version routing and auto-promote/rollback
Workflow simulation with dry-run, sandbox, and failure-injection modes
Visual debug view with per-step inspection, data flow, and cross-run comparison

DAG runtime model

A workflow run materializes as many workflow_step_runs as there are steps in the selected workflow version snapshot. Key runtime fields:

deps_required: number of direct dependencies declared by the step
deps_completed: number of completed/advanced dependencies observed so far
status: pending|waiting|running|completed|failed|skipped|canceled
attempt: step-level attempt counter

Progression is callback-driven. When a step reaches terminal state (or is skipped), the engine advances dependent steps by incrementing counters and evaluating scheduling constraints.

Scheduling and fan-in mechanics

Atomic fan-in

When a parent step completes, the engine increments dependency counters for its children with atomic SQL updates. A child becomes runnable when:

deps_completed == deps_required
and it is not already terminal/running

This prevents double-start races when many parents complete concurrently.

Targeted scheduling reads

The scheduler uses targeted read sets per workflow run:

statuses map: step_ref -> status
currently running step runs
currently runnable step runs

This avoids broad full-run scans in hot progression paths and keeps scheduling deterministic under load.

Scheduling blockers (recorded decisions)

Before starting a runnable step, the scheduler checks:

Workflow-level parallelism cap: max_parallel_steps
Step-level serialization: concurrency_key
Resource-class capacity (resource_class)
Step condition evaluation

When a step is blocked or skipped, a decision is recorded to workflow_step_decisions for explainability.

Resource classes

resource_class is a step-level scheduling hint and quota bucket. Current runtime limits are enforced in scheduler logic:

small: 50 concurrent running steps
medium: 20
large: 5

If absent or unknown, class resolves to small.

Condition DSL

The condition evaluator currently supports:

step_status
step_status_in
all_of
any_of
not
eq, ne
gt, gte, lt, lte
contains
in
regex
exists

If a condition evaluates to false, the step is marked skipped and progression continues.

Workflow policy controls

Project policies are configured via:

PUT /v1/workflow-policies/{projectID}
GET /v1/workflow-policies/{projectID}

Policy fields:

max_fan_out
max_depth
forbidden_step_types[]
require_approval_for_deploy

Enforcement points:

create workflow
update workflow
trigger workflow

This means invalid DAG shapes are rejected at API boundaries instead of failing later during runtime.

Explainability APIs

Step decision stream

GET /v1/workflow-runs/{workflowRunID}/explain Optional filters:

step_ref
decision_type

Returns paginated decision records (decision, explanation, details, timestamp) so operators can answer: why did this step wait or skip?

Runtime graph + critical path

GET /v1/workflow-runs/{workflowRunID}/graph Returns:

nodes (state/timing per step)
edges (dependency links)
roots
runnable set
critical_path
critical_path_estimate_ms
critical_path_remaining_ms

Critical-path timing strategy:

terminal steps: use observed runtime duration
running steps: use elapsed time so far
pending/waiting steps: use timeout_secs_override when present

Runtime recovery APIs

Retry one step

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry Behavior:

requires target step run to be terminal
resets step run to pending
clears started/finished/error/output/event_key
resumes workflow progression

Replay a subtree

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree Behavior:

computes descendants from workflow version DAG
resets selected step + descendants to pending
resumes workflow progression

Use this when one failed branch should be replayed without re-running unrelated successful branches.

Version insight APIs

Diff versions

GET /v1/workflows/{workflowID}/versions/{fromVersionID}/diff/{toVersionID} Returns step refs added/removed between versions.

Version impact

GET /v1/workflows/{workflowID}/versions/{versionID}/impact Returns how many sampled runs match the requested workflow version.

Simulate snapshot

POST /v1/workflows/{workflowID}/simulate Returns snapshot-level ordering metadata (predicted_order, step_count) for the current workflow version.

API examples (runtime and recovery)

Explain a blocked or skipped step

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/explain?step_ref=deploy&decision_type=resource" \
  -H "Authorization: Bearer strait_live_abc123"

{
  "data": [
    {
      "id": "dec_1",
      "workflow_run_id": "wr_123",
      "step_ref": "deploy",
      "decision_type": "resource",
      "decision": "blocked",
      "explanation": "resource class large at capacity",
      "details": { "resource_class": "large", "running": 5, "limit": 5 },
      "created_at": "2026-03-11T10:21:00Z"
    }
  ],
  "next_cursor": "2026-03-11T10:21:00Z"
}

Inspect graph and critical path

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/graph" \
  -H "Authorization: Bearer strait_live_abc123"

{
  "workflow_run_id": "wr_123",
  "workflow_id": "wf_release",
  "version": 12,
  "roots": ["build"],
  "runnable": ["deploy"],
  "critical_path": ["build", "test", "deploy"],
  "critical_path_estimate_ms": 330000,
  "critical_path_remaining_ms": 210000
}

Retry a terminal step

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/retry" \
  -H "Authorization: Bearer strait_live_abc123"

Replay a failed subtree

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/replay-subtree" \
  -H "Authorization: Bearer strait_live_abc123"

Upsert project policy

curl -X PUT "https://strait.dev/v1/workflow-policies/proj_1" \
  -H "Authorization: Bearer strait_live_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "max_fan_out": 15,
    "max_depth": 12,
    "forbidden_step_types": ["sleep"],
    "require_approval_for_deploy": true
  }'

Reaper-driven DAG safety nets

The scheduler reaper includes stalled workflow detection (WF_STALL_THRESHOLD) with configurable action (WF_STALL_ACTION):

log_only
reconcile (resume callback)
fail_workflow

This protects runs that stop progressing due to transient callback or scheduling failures.

Known operational limits

Max workflow steps per definition: 1000
concurrency_key max length: 128 chars
event_key max length: 512 chars
Sub-workflow default nesting depth limit: 10

Recommended operator workflow

When a workflow appears stuck:

Check graph: /workflow-runs/{id}/graph
Check decisions: /workflow-runs/{id}/explain
Inspect policy constraints for the project
Retry a single step or replay subtree if failure is localized
Use stalled-workflow reaper action reconcile in environments where auto-healing is preferred

Getting Started

Core Concepts

More Concepts

Configuration

Guides

AI Agents

Integrations

Operations

Development

What is implemented today

DAG runtime model

Scheduling and fan-in mechanics

Atomic fan-in

Targeted scheduling reads

Scheduling blockers (recorded decisions)

Resource classes

Condition DSL

Workflow policy controls

Explainability APIs

Step decision stream

Runtime graph + critical path

Runtime recovery APIs

Retry one step

Replay a subtree

Version insight APIs

Diff versions

Version impact

Simulate snapshot

API examples (runtime and recovery)

Explain a blocked or skipped step

Inspect graph and critical path

Retry a terminal step

Replay a failed subtree

Upsert project policy

Reaper-driven DAG safety nets

Known operational limits

Recommended operator workflow

Getting Started

Core Concepts

More Concepts

Configuration

Guides

AI Agents

Integrations

Operations

Development

Documentation Index

​What is implemented today

​DAG runtime model

​Scheduling and fan-in mechanics

​Atomic fan-in

​Targeted scheduling reads

​Scheduling blockers (recorded decisions)

​Resource classes

​Condition DSL

​Workflow policy controls

​Explainability APIs

​Step decision stream

​Runtime graph + critical path

​Runtime recovery APIs

​Retry one step

​Replay a subtree

​Version insight APIs

​Diff versions

​Version impact

​Simulate snapshot

​API examples (runtime and recovery)

​Explain a blocked or skipped step

​Inspect graph and critical path

​Retry a terminal step

​Replay a failed subtree

​Upsert project policy

​Reaper-driven DAG safety nets

​Known operational limits

​Recommended operator workflow

What is implemented today

DAG runtime model

Scheduling and fan-in mechanics

Atomic fan-in

Targeted scheduling reads

Scheduling blockers (recorded decisions)

Resource classes

Condition DSL

Workflow policy controls

Explainability APIs

Step decision stream

Runtime graph + critical path

Runtime recovery APIs

Retry one step

Replay a subtree

Version insight APIs

Diff versions

Version impact

Simulate snapshot

API examples (runtime and recovery)

Explain a blocked or skipped step

Inspect graph and critical path

Retry a terminal step

Replay a failed subtree

Upsert project policy

Reaper-driven DAG safety nets

Known operational limits

Recommended operator workflow