Swarm Architecture¶
Overview¶
The swarm execution system collapses Rune's microservice architecture into a single-process "fat orchestrator" for local hardware. It coordinates multiple agents, training workers, and an evolution loop via asyncio.TaskGroup.
Architecture Diagram¶
graph TB
CLI[swarm.py CLI] --> Orch[Tier-1 Orchestrator]
Orch --> HP[HardwareProbe]
HP --> Budget[HardwareBudget]
Orch --> |spawns N| Agent[Agent Supervisors]
Orch --> TPM[Training Pool Manager]
Orch --> EW[Evolution Worker]
Orch --> MW[Memory Watchdog]
Agent --> Sandbox[Sandbox Backend]
Agent --> CDB[Checkpoint DB]
Agent --> |enqueue| TQ[Training Queue]
TQ --> TPM
TPM --> |ProcessPoolExecutor| Train[train_qlora]
TPM --> |single GPU| Sleep[vLLM sleep/wake]
EW --> Registry[Adapter Registry]
EW --> Merge[TIES/DARE Merge]
MW --> |cancel on pressure| Agent
Components¶
| Component | Location | Role |
|---|---|---|
| HardwareProbe | libs/shared/src/shared/hardware.py |
Detect CPU, RAM, GPU resources |
| SandboxBackend | libs/shared/src/shared/sandbox.py |
Execute code safely (subprocess/nsjail) |
| SwarmCheckpointDB | libs/shared/src/shared/checkpoint_db.py |
Track task execution state |
| AdapterRegistry | libs/adapter-registry/ |
Store and query adapter metadata |
| Training Pool | scripts/swarm_workers.py |
Manage concurrent training jobs |
| Evolution Worker | scripts/swarm_evolution.py |
Periodic merge + prune sweeps |
| Swarm Orchestrator | scripts/swarm.py |
Top-level coordinator |
GPU Time-Sharing¶
In single-GPU mode, the training pool manager coordinates with vLLM:
- Sleep — POST
/sleepto release GPU memory - Train — Run QLoRA training in subprocess
- Wake — POST
/wake_upto reclaim GPU memory
Multi-GPU mode skips sleep/wake and uses dedicated GPUs.
Evolution Strategy¶
Every evolution_interval seconds:
- For each task type with ≥5 adapters, TIES-merge the top 3
- Archive any adapter with fitness < 0.3
- New merged adapters inherit
generation = max(parents) + 1