API Reference for model-training¶
model_training ¶
Training pipelines for LoRA fine-tuning and trajectory management.
Classes¶
D2LTrainConfig ¶
Bases: BaseModel
Pydantic model for D2L training hyperparameters.
Enables validation, JSON serialization (for checkpoint storage), and
.model_dump() for MLflow experiment logging.
Attributes:
| Name | Type | Description |
|---|---|---|
base_model_name |
str
|
HuggingFace model name for the student/teacher base. |
sakana_checkpoint_path |
str
|
Path to the Sakana hypernet checkpoint. |
num_steps |
int
|
Total training steps. |
lr |
float
|
Learning rate for AdamW optimizer. |
alpha |
float
|
Blending weight for KL vs CE loss (1.0 = pure KL, 0.0 = pure CE). |
temperature |
float
|
Softmax temperature for KL divergence computation. |
checkpoint_every |
int
|
Steps between lightweight checkpoint saves. |
full_checkpoint_every |
int
|
Steps between full checkpoint saves (incl. optimizer). |
checkpoint_dir |
str
|
Directory for checkpoint output. |
experiment_name |
str
|
MLflow experiment name. |
dry_run |
bool
|
If True, validate tensor shapes then exit. |
smoke_test |
bool
|
If True, run 5 steps and verify loss trend. |
dataset_path |
str | None
|
Path to training JSONL file (required for full training). |
grad_clip |
float
|
Gradient clipping max norm. |
warmup_steps |
int
|
Number of linear LR warmup steps. |
lora_r |
int
|
LoRA rank. |
max_length |
int
|
Maximum tokenizer sequence length. |
Functions¶
train_d2l_qwen3 ¶
train_d2l_qwen3(config: D2LTrainConfig) -> dict[str, Any]
Run KL-divergence context distillation training.
Three execution modes controlled by config flags: - dry_run=True: Validate shapes with single forward pass, no optimizer step. - smoke_test=True: Run min(num_steps, 5) steps, assert finite decreasing loss. - default: Full training from dataset with checkpointing and MLflow tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
D2LTrainConfig
|
Training configuration. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with training results: - final_loss: Loss at the last step. - best_loss: Lowest loss seen during training. - num_steps_completed: Number of training steps completed. - checkpoint_dir: Path to checkpoint directory. - shape_summary (dry_run only): Tensor shape validation results. |
Source code in libs/model-training/src/model_training/d2l_train.py
475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 | |
format_for_sft ¶
format_for_sft(
trajectory: dict[str, Any],
) -> list[dict[str, str]]
Convert a trajectory into SFT-compatible chat format.
Only successful trajectories (outcome == 'success') produce output. Extracts the final step where tests_passed is True as the assistant message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
dict[str, Any]
|
A trajectory dict as returned by load_trajectory. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
A list of 3 message dicts ([system, user, assistant]) for successful |
list[dict[str, str]]
|
trajectories, or an empty list if the trajectory did not succeed. |
Source code in libs/model-training/src/model_training/trajectory.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |
load_trajectory ¶
load_trajectory(trajectory_id: str) -> dict[str, Any]
Load a stored trajectory by session ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory_id
|
str
|
The session ID used as the filename (without .json). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict containing the full trajectory data including steps and metadata. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If no trajectory file exists for the given ID. |
Source code in libs/model-training/src/model_training/trajectory.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
record_trajectory ¶
record_trajectory(
session_id: str,
steps: list[dict[str, Any]],
outcome: Optional[str] = None,
*,
task_description: str = "",
task_type: str = "",
adapter_ids: list[str] | None = None,
) -> dict[str, Any]
Persist a coding session trajectory to disk for future distillation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session_id
|
str
|
Unique identifier for the coding session. |
required |
steps
|
list[dict[str, Any]]
|
List of step dicts, each containing attempt results. |
required |
outcome
|
Optional[str]
|
Final session result ('success', 'exhausted', or None). |
None
|
task_description
|
str
|
Natural language description of the coding task. |
''
|
task_type
|
str
|
Category of task (e.g. 'function', 'class', 'refactor'). |
''
|
adapter_ids
|
list[str] | None
|
LoRA adapter IDs used during the session. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict with 'session_id' and 'file_path' keys. |
Source code in libs/model-training/src/model_training/trajectory.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
Modules¶
config ¶
Training configuration for LoRA fine-tuning.
No GPU imports required — this module is pure dict construction and validation.
Functions¶
get_training_config ¶
get_training_config(
task_type: str,
rank: int = 64,
epochs: int = 3,
learning_rate: float = 0.0002,
) -> dict[str, Any]
Return a training configuration dict with hyperparameters.
Generates a configuration appropriate for the given task type, with sensible defaults for QLoRA fine-tuning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_type
|
str
|
Task category (e.g. 'bug-fix', 'feature-impl'). |
required |
rank
|
int
|
LoRA rank for the adapter. |
64
|
epochs
|
int
|
Number of training epochs. |
3
|
learning_rate
|
float
|
Learning rate for the optimizer. |
0.0002
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict containing all training hyperparameters. |
Example
config = get_training_config("bug-fix", rank=64, epochs=3) config["task_type"] 'bug-fix'
Source code in libs/model-training/src/model_training/config.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
validate_config ¶
validate_config(config: dict[str, Any]) -> bool
Validate training configuration fields and value ranges.
Checks that all required fields are present and their values fall within acceptable ranges.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
A training configuration dict to validate. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the configuration is valid. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required keys are missing or values are out of range. |
Example
valid = validate_config({"task_type": "bug-fix", "rank": 64, ... "epochs": 3, "learning_rate": 2e-4}) valid True
Source code in libs/model-training/src/model_training/config.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
d2l_config ¶
Config helpers for Qwen3-Coder-Next hypernetwork training.
Requires transformers>=5.0 for Qwen3NextConfig (hybrid linear/full attention architecture). transformers 5.3.0 is installed in this project.
All heavy imports (transformers, ctx_to_lora, peft) are deferred to function bodies per project convention (INFRA-05) to avoid GPU imports at module level.
Functions¶
get_d2l_qwen3_config ¶
get_d2l_qwen3_config() -> dict[str, Any]
Return Qwen3-Coder-Next architecture dimensions without loading model weights.
Uses Qwen3NextConfig defaults which exactly match Qwen3-Coder-Next specs: - hidden_size: 2048 - num_hidden_layers: 48 (12 full_attention + 36 linear_attention) - num_attention_heads: 16 (Q heads), num_key_value_heads: 2 (GQA KV) - head_dim: 256 - full_attention layer indices: [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47] - vocab_size: 151936 - model_type: "qwen3_next"
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with keys: hidden_size, num_hidden_layers, num_attention_heads, |
dict[str, Any]
|
num_key_value_heads, head_dim, attention_layer_indices, vocab_size, |
dict[str, Any]
|
model_type. |
Source code in libs/model-training/src/model_training/d2l_config.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | |
build_qwen3_hypernet_config ¶
build_qwen3_hypernet_config(
lora_r: int = 8,
target_modules: list[str] | None = None,
aggregator_config: Any = None,
) -> Any
Construct HypernetConfig targeting Qwen3-Coder-Next attention layers.
Discovers full_attention layer indices dynamically from Qwen3NextConfig.layer_types. Result has exactly 12 layer indices matching the Qwen3-Coder-Next architecture.
Phase 26 probe cache integration: if a probe cache exists for QWEN3_NEXT_CANONICAL_NAME, uses real per-projection in/out dimensions for feature_sizes. Falls back to hidden_size placeholder when no cache is found (e.g., in CI where the model has not been probed).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lora_r
|
int
|
LoRA rank for the adapter. Defaults to 8. |
8
|
target_modules
|
list[str] | None
|
LoRA target module names. Defaults to ["q_proj", "v_proj"]. |
None
|
aggregator_config
|
Any
|
Perceiver aggregator config from a Sakana checkpoint. If None (default / Phase 25 CI), HypernetConfig is built with aggregator_config=None as placeholder. Phase 29 populates this via get_aggregator_config() with a loaded model. |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
HypernetConfig with layer_indices set to the 12 full_attention indices |
Any
|
and base_hidden_size=2048. |
Source code in libs/model-training/src/model_training/d2l_config.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
d2l_data ¶
Data pipeline for KL-divergence context distillation training.
Provides functions for: - Converting trajectories to distillation records (activation/teacher split) - Generating needle-in-haystack synthetic datasets for CI smoke testing - JSONL persistence (save/load round-trip) - Task-ID-level train/test splitting
Functions¶
save_jsonl ¶
save_jsonl(
records: list[dict[str, Any]], path: str | Path
) -> None
Persist a list of dicts to a JSONL file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[dict[str, Any]]
|
List of JSON-serializable dicts to save. |
required |
path
|
str | Path
|
File path for output. Parent directories are created if needed. |
required |
Source code in libs/model-training/src/model_training/d2l_data.py
256 257 258 259 260 261 262 263 264 265 266 267 | |
load_jsonl ¶
load_jsonl(path: str | Path) -> list[dict[str, Any]]
Load records from a JSONL file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the JSONL file. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of dicts, one per non-empty line. |
Source code in libs/model-training/src/model_training/d2l_data.py
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | |
format_for_distillation ¶
format_for_distillation(
trajectory: dict[str, Any],
) -> list[dict[str, str]]
Convert a trajectory to distillation records with activation/teacher split.
Each record has: - activation_text: trajectory context + task description (NO answer tokens) - teacher_text: trajectory context + task description + answer - task_id: identifier for train/test splitting
Only successful trajectories (outcome == 'success') produce records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
dict[str, Any]
|
Trajectory dict with task_id/session_id, task_description, steps, and outcome fields. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
List of distillation record dicts, or empty list if no successful outcome. |
Source code in libs/model-training/src/model_training/d2l_data.py
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 | |
normalize_mined_trajectory ¶
normalize_mined_trajectory(
mined: dict[str, Any],
) -> dict[str, Any]
Convert a GitHub-mined trajectory dict into distillation-ready format.
Maps the mining pipeline's output (PR/issue metadata with commit and review
steps) into the schema expected by format_for_distillation.
Outcome mapping
pr_*task_ids:"merged"->"success", else"failure"issue_*task_ids:"closed"->"success", else"failure"- Fallback:
"merged"or"closed"->"success", else"failure"
Step mapping
- Commit steps get
[Commit]prefix in description and their content asgenerated_code. - Review steps get
[Review]prefix with content inlined into description and emptygenerated_code. - Only the last commit step receives
tests_passed=Trueandcanonical_solution(and only when the overall outcome is success).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mined
|
dict[str, Any]
|
Dict from the GitHub mining pipeline with keys |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Trajectory dict ready for |
Source code in libs/model-training/src/model_training/d2l_data.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 | |
generate_needle_dataset ¶
generate_needle_dataset(
n: int = 20,
) -> list[dict[str, str]]
Generate needle-in-haystack records for CI smoke testing.
Records are deterministic (no randomness, no LLM). Each record contains a code fact embedded in a function/class context with a query and answer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of records to generate. Cycles through templates if n exceeds the number of available templates. |
20
|
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
List of n record dicts with activation_text, teacher_text, and task_id. |
Source code in libs/model-training/src/model_training/d2l_data.py
424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 | |
generate_trajectory_dataset ¶
generate_trajectory_dataset(
source: str = "humaneval", max_tasks: int | None = None
) -> list[dict[str, str]]
Generate trajectory records from a coding task dataset.
Each record has activation_text (prompt only, no solution) and teacher_text (prompt + canonical solution), making it ready for KL-divergence distillation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Dataset source identifier. Currently only "humaneval" is supported. |
'humaneval'
|
max_tasks
|
int | None
|
Maximum number of tasks to process. If None, all tasks are used. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
List of trajectory record dicts with task_id, activation_text, and |
list[dict[str, str]]
|
teacher_text fields. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported source is specified. |
Source code in libs/model-training/src/model_training/d2l_data.py
471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 | |
augment_trajectories ¶
augment_trajectories(
trajectories: list[dict[str, Any]],
n_variants: int = 3,
model: str = "qwen2.5-coder:1.5b",
ollama_base_url: str | None = None,
) -> list[dict[str, Any]]
Produce LLM-augmented variants of trajectory records.
For each input trajectory, generates up to n_variants augmented records using an Ollama LLM. Augmented records always inherit the source task_id to preserve split integrity when mixed with originals.
Augmentation strategies (up to n_variants selected in order): 1. Paraphrase the task description 2. Reorder/drop steps in the trajectory 3. Rename variables throughout the trajectory
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectories
|
list[dict[str, Any]]
|
List of trajectory record dicts with task_id, activation_text, and teacher_text fields. |
required |
n_variants
|
int
|
Number of augmented variants to produce per trajectory. Maximum 3 (one per augmentation strategy). |
3
|
model
|
str
|
Ollama model identifier to use for augmentation. |
'qwen2.5-coder:1.5b'
|
ollama_base_url
|
str | None
|
Ollama base URL. Defaults to "http://localhost:11434". |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of augmented trajectory dicts. Each record has the same task_id |
list[dict[str, Any]]
|
as its source trajectory, with LLM-generated activation_text and |
list[dict[str, Any]]
|
teacher_text. |
Source code in libs/model-training/src/model_training/d2l_data.py
524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 | |
split_by_task_id ¶
split_by_task_id(
records: list[dict[str, Any]],
test_fraction: float = 0.2,
seed: int = 42,
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]
Split records at task-ID boundary with no task_id crossing train/test.
Augmented records that share a task_id are always assigned to the same partition, preventing task-family leakage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[dict[str, Any]]
|
List of record dicts, each with a 'task_id' field. |
required |
test_fraction
|
float
|
Fraction of unique task_ids to assign to test set. Minimum 1 task_id goes to test even if fraction rounds to 0. |
0.2
|
seed
|
int
|
Random seed for reproducible splits. |
42
|
Returns:
| Type | Description |
|---|---|
tuple[list[dict[str, Any]], list[dict[str, Any]]]
|
Tuple of (train_records, test_records) where task_ids never overlap. |
Source code in libs/model-training/src/model_training/d2l_data.py
629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 | |
normalize_mined_pairs ¶
normalize_mined_pairs(
trajectory: dict[str, Any],
compress: bool = True,
max_diff_lines: int = 500,
language: str | None = None,
) -> list[dict[str, Any]]
Convert a mined PR trajectory into per-step training pairs.
Each review-to-revision cycle becomes one training record with
activation_text (task + current code + review feedback) and
teacher_text (activation + revision diff). Compatible with
augment_trajectories, split_by_task_id, and save_jsonl.
The algorithm groups contiguous commits and reviews into blocks, then pairs each reviews-block with the following commits-block. Multiple commits in a block: the last commit is used (the state the reviewer actually saw). Multiple reviews: concatenated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
dict[str, Any]
|
Raw mined trajectory from |
required |
compress
|
bool
|
Apply diff compression via |
True
|
max_diff_lines
|
int
|
Max lines per compressed diff. |
500
|
language
|
str | None
|
Language tag for metadata (from repos config). |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of training pair records with task_id, activation_text, |
list[dict[str, Any]]
|
teacher_text, and metadata fields. |
Source code in libs/model-training/src/model_training/d2l_data.py
701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 | |
d2l_diff ¶
RTK-style diff compression for training data preparation.
Filters irrelevant files (lockfiles, generated code, binary assets, build artifacts) and truncates large diffs to minimize token overhead in hypernetwork training pairs.
Operates on the concatenated diff format produced by mine_pr_diff_chains
--- src/main.py --- +real code --- package-lock.json --- +lockfile noise
Functions¶
compress_diff ¶
compress_diff(content: str, max_lines: int = 500) -> str
Filter irrelevant files and truncate large diffs.
Parses the --- filename --- section format produced by
mine_pr_diff_chains and removes lockfiles, generated code,
binary files, and build artifacts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str
|
Concatenated diff content with section headers. |
required |
max_lines
|
int
|
Maximum total output lines. |
500
|
Returns:
| Type | Description |
|---|---|
str
|
Filtered and truncated diff string. |
Source code in libs/model-training/src/model_training/d2l_diff.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
d2l_lora ¶
Functional LoRA injection via context manager for hypernetwork training.
Patches transformer attention projection modules with F.linear forward functions that carry live hypernetwork tensor graph nodes, preserving autograd continuity through A and B matrices back to the hypernetwork head.
Unlike PEFT's get_peft_model (which severs the autograd graph by copying tensors into new nn.Parameter objects), this approach uses closures over the original A/B tensors so that loss.backward() propagates gradients through the LoRA path all the way to the hypernetwork parameters.
All heavy GPU imports (torch, transformers) are deferred to function bodies per INFRA-05 project convention.
Functions¶
apply_functional_lora ¶
apply_functional_lora(
model: Any, lora_dict: dict[str, Any], hc: Any
) -> _FunctionalLoRAContext
Create a context manager that patches model with functional LoRA.
Usage
with apply_functional_lora(base_model, lora_dict, hc): output = base_model(input_ids) loss = criterion(output, target) loss.backward() # gradients flow through A/B to hypernetwork
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
Base transformer model (nn.Module). |
required |
lora_dict
|
dict[str, Any]
|
Dict from HyperLoRA.generate_weights(). Structure: lora_dict[proj_name]["A"] shape: (batch=1, n_layers, r, d_in) lora_dict[proj_name]["B"] shape: (batch=1, n_layers, r, d_out) Keys match hc.lora_config.target_modules. Batch dimension is always 1 (squeezed at index 0). |
required |
hc
|
Any
|
HypernetConfig with attributes: hc.lora_config.target_modules: list of projection names hc.lora_config.r: LoRA rank hc.lora_config.lora_alpha: scaling numerator hc.layer_indices: list of absolute layer indices |
required |
Returns:
| Type | Description |
|---|---|
_FunctionalLoRAContext
|
_FunctionalLoRAContext to use as context manager. |
Source code in libs/model-training/src/model_training/d2l_lora.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | |
d2l_mining ¶
GitHub trajectory mining for coding session distillation.
Mines PR diff chains and issue-commit chains from GitHub repositories, producing trajectory dicts suitable for normalization and distillation. Designed to run on an L4 VM with network access and a GITHUB_TOKEN.
Classes¶
Functions¶
search_quality_prs ¶
search_quality_prs(
repo: str,
max_results: int = 100,
github_token: str | None = None,
min_review_comments: int = 1,
min_commits: int = 2,
exclude_labels: list[str] | None = None,
) -> list[int]
Search for high-quality merged PRs using the GitHub Search API.
Pre-filters PRs by review approval, comment count, label exclusion, and minimum commit count to identify PRs with meaningful review trajectories suitable for distillation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo
|
str
|
GitHub repository in "owner/repo" format. |
required |
max_results
|
int
|
Maximum number of qualifying PR numbers to return. |
100
|
github_token
|
str | None
|
Personal access token for GitHub API authentication. |
None
|
min_review_comments
|
int
|
Minimum number of comments for search query. |
1
|
min_commits
|
int
|
Minimum number of commits a PR must have. |
2
|
exclude_labels
|
list[str] | None
|
Labels to exclude. Defaults to common non-code labels. |
None
|
Returns:
| Type | Description |
|---|---|
list[int]
|
List of qualifying PR numbers. |
Source code in libs/model-training/src/model_training/d2l_mining.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
mine_pr_diff_chains ¶
mine_pr_diff_chains(
repo: str,
max_prs: int = 100,
github_token: str | None = None,
pr_numbers: list[int] | None = None,
) -> list[dict[str, Any]]
Extract PR diff chains from a GitHub repository.
Each chain represents an iterative coding session: initial commit -> review comments -> revision commits. The resulting trajectory records capture the back-and-forth of code review as a multi-step improvement process, suitable for distillation.
Returns trajectory dicts with the following fields: - task_id: f"pr_{repo}_{pr_number}" - task_description: PR title concatenated with body text - steps: list of commit diffs and review comments in chronological order - outcome: "merged" or "closed" depending on PR final state
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo
|
str
|
GitHub repository in "owner/repo" format. |
required |
max_prs
|
int
|
Maximum number of PRs to process. Defaults to 100. |
100
|
github_token
|
str | None
|
Personal access token for GitHub API authentication. |
None
|
pr_numbers
|
list[int] | None
|
Optional list of specific PR numbers to mine. When provided, skips the paginated PR list fetch and fetches each PR individually. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of trajectory dicts representing PR diff chains. |
Source code in libs/model-training/src/model_training/d2l_mining.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | |
mine_issue_commit_chains ¶
mine_issue_commit_chains(
repo: str,
max_issues: int = 100,
github_token: str | None = None,
) -> list[dict[str, Any]]
Link GitHub issues to their fixing commits via commit message references.
Scans commit messages for "fixes #N", "closes #N", or "resolves #N" patterns to identify which commits address which issues. Groups linked commits as trajectory steps for distillation.
Returns trajectory dicts with the following fields: - task_id: f"issue_{repo}_{issue_number}" - task_description: issue title concatenated with body text - steps: list of commits referencing this issue in chronological order - outcome: "closed" or "open" from the issue state
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo
|
str
|
GitHub repository in "owner/repo" format. |
required |
max_issues
|
int
|
Maximum number of issues to process. Defaults to 100. |
100
|
github_token
|
str | None
|
Personal access token for GitHub API authentication. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of trajectory dicts representing issue-commit chains. |
Source code in libs/model-training/src/model_training/d2l_mining.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 | |
d2l_prep ¶
Data preparation pipeline for context distillation training.
Converts raw trajectory JSON files into a training JSONL by calling format_for_distillation on each trajectory and persisting the resulting records via save_jsonl.
Usage (CLI): uv run python -m model_training.d2l_prep traj1.json traj2.json -o train.jsonl
Functions¶
prepare_training_jsonl ¶
prepare_training_jsonl(
input_paths: list[Path], output_path: Path
) -> int
Convert trajectory JSON files to a training JSONL.
Reads each input file, calls format_for_distillation on every trajectory, collects all returned records, and writes them to output_path via save_jsonl.
Failed trajectories (outcome != 'success') are filtered by format_for_distillation and produce zero records — they do not raise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_paths
|
list[Path]
|
Trajectory JSON files, each containing a single trajectory dict or a JSON array of trajectory dicts. |
required |
output_path
|
Path
|
Destination JSONL file. Parent directories are created automatically. File is always written (may be empty). |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of records written. |
Source code in libs/model-training/src/model_training/d2l_prep.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
d2l_probe ¶
Architecture probe and activation extraction for hypernetwork training.
Discovers standard attention layers (those with q_proj/k_proj/v_proj/o_proj children) via model.named_modules(), caches results to JSON, and provides extract_activations_with_model() that accepts a pre-loaded model and tokenizer.
Phase 26 purpose: eliminate hidden_size placeholders and per-call model loading. The probe becomes the single source of truth for layer indices and projection dimensions across the v7.0 pipeline.
All heavy GPU imports (torch, transformers) are deferred to function bodies per INFRA-05 project convention.
Functions¶
probe_model ¶
probe_model(model: Any) -> dict[str, Any]
Probe a model's architecture to discover standard attention layers.
Iterates model.named_modules() to find layers that have all four attention projection children (q_proj, k_proj, v_proj, o_proj). DeltaNet and other linear-attention layers that lack these projections are skipped.
For each discovered attention layer, captures the in/out dimensions of q_proj, k_proj, v_proj, and o_proj weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
Any nn.Module (typically a transformer model). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with keys: - attention_layer_indices: sorted list of int layer indices - feature_sizes: dict mapping projection name to {"in": int, "out": int} |
Source code in libs/model-training/src/model_training/d2l_probe.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | |
save_probe_cache ¶
save_probe_cache(
model_name: str, probe_result: dict[str, Any]
) -> Path
Persist probe results to JSON cache.
Adds metadata fields (model_name, model_name_hash, probed_at) to a copy of probe_result. Creates PROBE_CACHE_DIR if it does not exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Canonical model identifier (used for cache lookup key). |
required |
probe_result
|
dict[str, Any]
|
Output from probe_model(). |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to the written JSON file. |
Source code in libs/model-training/src/model_training/d2l_probe.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | |
load_probe_cache ¶
load_probe_cache(model_name: str) -> dict[str, Any] | None
Load probe results from JSON cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Canonical model identifier. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
Probe result dict (including metadata fields) if cached, else None. |
dict[str, Any] | None
|
Never raises — returns None on any miss. |
Source code in libs/model-training/src/model_training/d2l_probe.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
extract_activations_with_model ¶
extract_activations_with_model(
text: str,
model: Any,
tokenizer: Any,
layer_indices: list[int] | None = None,
model_name: str | None = None,
max_length: int = 512,
) -> tuple[Any, Any]
Extract per-layer hidden state activations from a pre-loaded model.
Runs text through the model with output_hidden_states=True and stacks activations from the specified layer indices. Uses hidden_states[i] directly (no +1 offset) — consistent with existing sakana_d2l.py convention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to tokenize and process. |
required |
model
|
Any
|
Pre-loaded nn.Module in eval mode. |
required |
tokenizer
|
Any
|
Pre-loaded tokenizer. |
required |
layer_indices
|
list[int] | None
|
Which hidden state indices to extract. If None, loads from probe cache via model_name. |
None
|
model_name
|
str | None
|
Canonical model name for cache lookup (required when layer_indices is None). |
None
|
max_length
|
int
|
Max token sequence length. |
512
|
Returns:
| Type | Description |
|---|---|
tuple[Any, Any]
|
Tuple of (features, attention_mask): features shape: (1, num_layers, seq_len, hidden_dim) attention_mask shape: (1, seq_len) |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If layer_indices is None and no probe cache exists for model_name. |
Source code in libs/model-training/src/model_training/d2l_probe.py
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
d2l_train ¶
KL-divergence context distillation training loop for Qwen3-Coder-Next.
Assembles all Phase 25-28 components (config, data pipeline, activation extraction, weight transfer, functional LoRA injection) into a complete distillation training script.
Three execution modes: - dry-run: loads real base model + hypernet, validates tensor shapes, exits - smoke-test: 5 training steps, verifies finite loss and decreasing trend - full: trains from JSONL dataset with tiered checkpointing and MLflow tracking
All heavy GPU imports (torch, transformers, peft) are deferred to function bodies per INFRA-05 project convention.
Usage
uv run python -m model_training.d2l_train --dry-run uv run python -m model_training.d2l_train --smoke-test uv run python -m model_training.d2l_train --dataset path/to/train.jsonl
Classes¶
D2LTrainConfig ¶
Bases: BaseModel
Pydantic model for D2L training hyperparameters.
Enables validation, JSON serialization (for checkpoint storage), and
.model_dump() for MLflow experiment logging.
Attributes:
| Name | Type | Description |
|---|---|---|
base_model_name |
str
|
HuggingFace model name for the student/teacher base. |
sakana_checkpoint_path |
str
|
Path to the Sakana hypernet checkpoint. |
num_steps |
int
|
Total training steps. |
lr |
float
|
Learning rate for AdamW optimizer. |
alpha |
float
|
Blending weight for KL vs CE loss (1.0 = pure KL, 0.0 = pure CE). |
temperature |
float
|
Softmax temperature for KL divergence computation. |
checkpoint_every |
int
|
Steps between lightweight checkpoint saves. |
full_checkpoint_every |
int
|
Steps between full checkpoint saves (incl. optimizer). |
checkpoint_dir |
str
|
Directory for checkpoint output. |
experiment_name |
str
|
MLflow experiment name. |
dry_run |
bool
|
If True, validate tensor shapes then exit. |
smoke_test |
bool
|
If True, run 5 steps and verify loss trend. |
dataset_path |
str | None
|
Path to training JSONL file (required for full training). |
grad_clip |
float
|
Gradient clipping max norm. |
warmup_steps |
int
|
Number of linear LR warmup steps. |
lora_r |
int
|
LoRA rank. |
max_length |
int
|
Maximum tokenizer sequence length. |
Functions¶
train_d2l_qwen3 ¶
train_d2l_qwen3(config: D2LTrainConfig) -> dict[str, Any]
Run KL-divergence context distillation training.
Three execution modes controlled by config flags: - dry_run=True: Validate shapes with single forward pass, no optimizer step. - smoke_test=True: Run min(num_steps, 5) steps, assert finite decreasing loss. - default: Full training from dataset with checkpointing and MLflow tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
D2LTrainConfig
|
Training configuration. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with training results: - final_loss: Loss at the last step. - best_loss: Lowest loss seen during training. - num_steps_completed: Number of training steps completed. - checkpoint_dir: Path to checkpoint directory. - shape_summary (dry_run only): Tensor shape validation results. |
Source code in libs/model-training/src/model_training/d2l_train.py
475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 | |
github_client ¶
Thin GitHub REST API client with auth, pagination, and rate-limit retry.
Designed for batch data mining on a training VM (sync httpx is fine).
Classes¶
GitHubClient ¶
GitHubClient(
token: str | None = None,
base_url: str = "https://api.github.com",
)
Minimal GitHub REST API client.
Handles authentication, paginated list endpoints, and automatic retry on rate-limit 403 responses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
GitHub personal access token. Optional for public endpoints but required for private repos and higher rate limits. |
None
|
base_url
|
str
|
API base URL. Override for GitHub Enterprise. |
'https://api.github.com'
|
Initialize the client with optional auth token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
GitHub personal access token. |
None
|
base_url
|
str
|
API base URL. |
'https://api.github.com'
|
Source code in libs/model-training/src/model_training/github_client.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
get(
path: str,
params: dict[str, Any] | None = None,
max_retries: int = 3,
) -> Any
GET a single API endpoint with rate-limit retry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
API path relative to base_url (e.g. |
required |
params
|
dict[str, Any] | None
|
Optional query parameters. |
None
|
max_retries
|
int
|
Maximum number of retries on rate-limit 403. |
3
|
Returns:
| Type | Description |
|---|---|
Any
|
Parsed JSON response body. |
Raises:
| Type | Description |
|---|---|
HTTPStatusError
|
On non-rate-limit error responses. |
Source code in libs/model-training/src/model_training/github_client.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | |
get_paginated(
path: str,
params: dict[str, Any] | None = None,
max_pages: int = 10,
per_page: int = 100,
) -> list[Any]
GET a paginated list endpoint, following Link rel=next headers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
API path relative to base_url. |
required |
params
|
dict[str, Any] | None
|
Optional query parameters ( |
None
|
max_pages
|
int
|
Maximum number of pages to fetch. |
10
|
per_page
|
int
|
Items per page (max 100 for most GitHub endpoints). |
100
|
Returns:
| Type | Description |
|---|---|
list[Any]
|
Flat list of all items across all fetched pages. |
Source code in libs/model-training/src/model_training/github_client.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
hypernetwork ¶
DocToLoraHypernetwork: Perceiver-based instant LoRA adapter generation.
Generates rank-8 LoRA adapter weights from token IDs in a single forward pass. Distinct from the QLoRA gradient-descent path (Phase 21) — this produces adapters in <1s by cross-attending over token embeddings with learned latents.
IMPORTANT: All GPU imports (torch, safetensors) are deferred inside function/method bodies per INFRA-05 pattern — this module is importable in CPU-only CI.
Usage
from model_training.hypernetwork import ( DocToLoraHypernetwork, save_hypernetwork_adapter, )
model = DocToLoraHypernetwork(input_dim=DEFAULT_VOCAB_SIZE) weights = model(token_ids) save_hypernetwork_adapter(weights, "/tmp/adapter", "Qwen/Qwen2.5-Coder-7B")
Functions¶
load_pretrained ¶
load_pretrained(
checkpoint_path: str, device: str = "cpu", **kwargs: Any
) -> Any
Load a pretrained DocToLoraHypernetwork from a checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint_path
|
str
|
Path to the .pt checkpoint file. |
required |
device
|
str
|
Device to load onto ('cpu', 'cuda', 'mps'). Default: 'cpu'. |
'cpu'
|
**kwargs
|
Any
|
Override constructor args (input_dim, num_latents, etc.). |
{}
|
Returns:
| Type | Description |
|---|---|
Any
|
DocToLoraHypernetwork nn.Module loaded with pretrained weights. |
Source code in libs/model-training/src/model_training/hypernetwork.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
trajectory_to_tokens ¶
trajectory_to_tokens(
trajectory_text: str,
vocab_size: int = DEFAULT_VOCAB_SIZE,
max_length: int = 2048,
) -> "torch.Tensor"
Encode trajectory text as token IDs for the hypernetwork.
Uses a simple hash-based tokenization (character trigrams mapped to vocab indices). This is intentionally simple — the hypernetwork learns its own embedding, so exact tokenization doesn't matter as long as it's consistent.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory_text
|
str
|
Text to encode (plan, code diffs, test results, etc.). |
required |
vocab_size
|
int
|
Size of the hypernetwork's embedding vocabulary. |
DEFAULT_VOCAB_SIZE
|
max_length
|
int
|
Maximum sequence length (truncates or pads). |
2048
|
Returns:
| Type | Description |
|---|---|
'torch.Tensor'
|
Token ID tensor of shape (1, max_length) ready for hypernetwork forward(). |
Source code in libs/model-training/src/model_training/hypernetwork.py
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
generate_adapter ¶
generate_adapter(
hypernetwork: Any,
trajectory_text: str,
output_dir: str,
base_model_id: str,
vocab_size: int = DEFAULT_VOCAB_SIZE,
max_length: int = 2048,
device: str = "cpu",
) -> str
End-to-end: encode trajectory, run hypernetwork, save adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hypernetwork
|
Any
|
A DocToLoraHypernetwork instance. |
required |
trajectory_text
|
str
|
Text to encode into the adapter. |
required |
output_dir
|
str
|
Directory to save the adapter files. |
required |
base_model_id
|
str
|
HuggingFace model ID of the base model. |
required |
vocab_size
|
int
|
Vocabulary size for tokenization. |
DEFAULT_VOCAB_SIZE
|
max_length
|
int
|
Max token sequence length. |
2048
|
device
|
str
|
Device for tensor operations. |
'cpu'
|
Returns:
| Type | Description |
|---|---|
str
|
Path to the saved adapter directory. |
Source code in libs/model-training/src/model_training/hypernetwork.py
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 | |
save_hypernetwork_adapter ¶
save_hypernetwork_adapter(
weights: dict[str, "torch.Tensor"],
output_dir: str,
base_model_id: str,
rank: int = 8,
target_modules: list[str] | None = None,
) -> None
Serialize hypernetwork-generated LoRA weights in PEFT adapter format.
Writes: - adapter_model.safetensors: the LoRA weight tensors - adapter_config.json: PEFT-compatible configuration
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weights
|
dict[str, 'torch.Tensor']
|
PEFT state_dict from DocToLoraHypernetwork.forward(). |
required |
output_dir
|
str
|
Directory to write adapter files to (created if needed). |
required |
base_model_id
|
str
|
HuggingFace model ID of the base model. |
required |
rank
|
int
|
LoRA rank. Default: 8. |
8
|
target_modules
|
list[str] | None
|
List of module names. Default: ["q_proj", "v_proj"]. |
None
|
Note
Does NOT include embed_tokens or lm_head — vLLM rejects these in adapters (per Phase 21-01 decision: no modules_to_save in LoraConfig).
Source code in libs/model-training/src/model_training/hypernetwork.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 | |
merging ¶
Adapter merging strategies for evolutionary combination.
Implements TIES-Merging and DARE-Merging for combining multiple LoRA adapter state dicts into a single merged adapter. All GPU imports are deferred inside function bodies (INFRA-05 pattern).
Functions¶
ties_merge ¶
ties_merge(
state_dicts: list[dict[str, Any]], density: float = 0.5
) -> dict[str, Any]
Merge adapter state dicts using TIES-Merging.
Trim-Elect-Sign-Disjoint merge: for each parameter, trims values below density threshold, elects the majority sign (ignoring trimmed values), then averages only the values matching the elected sign.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
state_dicts
|
list[dict[str, Any]]
|
List of state dicts (tensors) to merge. |
required |
density
|
float
|
Fraction of values to keep per parameter (0.0 to 1.0). |
0.5
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Merged state dict with same keys and shapes as inputs. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If state_dicts is empty or density is outside
|
Source code in libs/model-training/src/model_training/merging.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | |
dare_merge ¶
dare_merge(
state_dicts: list[dict[str, Any]],
drop_rate: float = 0.1,
seed: int | None = None,
) -> dict[str, Any]
Merge adapter state dicts using DARE-Merging.
Drop-And-REscale merge: randomly drops a fraction of values from each state dict, then averages the remaining values with rescaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
state_dicts
|
list[dict[str, Any]]
|
List of state dicts to merge. |
required |
drop_rate
|
float
|
Fraction of values to drop per parameter. Must be in
|
0.1
|
seed
|
int | None
|
Optional RNG seed for reproducible merges. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Merged state dict with same keys and shapes as inputs. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If state_dicts is empty or drop_rate is outside
|
Source code in libs/model-training/src/model_training/merging.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
load_adapter_state_dict ¶
load_adapter_state_dict(
adapter_path: str | Path,
) -> dict[str, Any]
Load a LoRA adapter state dict from a safetensors file or directory.
Accepts either a direct .safetensors file path or a PEFT adapter
directory containing adapter_model.safetensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_path
|
str | Path
|
Path to a |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
State dict mapping parameter names to tensors. |
Source code in libs/model-training/src/model_training/merging.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
peft_utils ¶
QLoRA PEFT configuration and adapter management.
All GPU library imports (peft, transformers, torch) are deferred inside function bodies to ensure CPU-only importability (INFRA-05).
Functions¶
build_qlora_config ¶
build_qlora_config(
rank: int,
alpha: int,
target_modules: list[str],
dropout: float = 0.1,
) -> Any
Build a QLoRA configuration for PEFT fine-tuning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rank
|
int
|
LoRA rank (dimensionality of low-rank matrices). |
required |
alpha
|
int
|
LoRA alpha scaling factor. |
required |
target_modules
|
list[str]
|
List of module names to apply LoRA to. |
required |
dropout
|
float
|
Dropout probability for LoRA layers. |
0.1
|
Returns:
| Type | Description |
|---|---|
Any
|
A peft LoraConfig instance configured for QLoRA. |
Example
config = build_qlora_config(rank=64, alpha=128, target_modules=["q_proj"])
Source code in libs/model-training/src/model_training/peft_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
apply_lora_adapter ¶
apply_lora_adapter(model: Any, config: Any) -> Any
Apply a LoRA adapter to a base model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
The base model to wrap with LoRA. |
required |
config
|
Any
|
The LoRA configuration (from build_qlora_config). |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The model wrapped with a LoRA adapter via peft.get_peft_model. |
Example
adapted_model = apply_lora_adapter(base_model, lora_config)
Source code in libs/model-training/src/model_training/peft_utils.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
merge_adapter ¶
merge_adapter(model: Any) -> Any
Merge LoRA weights into the base model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
A PEFT model with LoRA adapter applied. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The base model with LoRA weights merged in. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Adapter merging is out of scope for Phase 21. |
Example
merged = merge_adapter(peft_model)
Source code in libs/model-training/src/model_training/peft_utils.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |
sakana_d2l ¶
SakanaAI Doc-to-LoRA integration.
Wraps Sakana's pretrained HyperLoRA perceiver so it can be used through our hypernetwork interface (load_pretrained → generate_adapter).
The Sakana hypernetwork takes per-layer activations from a base model as input and produces LoRA adapter weights. This module handles: - Downloading the checkpoint from HuggingFace - Patching flash-attention assertions for CPU/MPS/non-flash environments - Extracting per-layer activations from the base model - Saving the generated LoRA weights in PEFT format
GPU imports are deferred inside function bodies per INFRA-05 pattern.
Functions¶
download_checkpoint ¶
download_checkpoint(variant: str = DEFAULT_VARIANT) -> Path
Download Sakana's pretrained checkpoint from HuggingFace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variant
|
str
|
Which checkpoint variant to download. Options: 'gemma_demo', 'gemma_2b_d2l', 'mistral_7b_d2l', 'qwen_4b_d2l'. |
DEFAULT_VARIANT
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the downloaded checkpoint file. |
Source code in libs/model-training/src/model_training/sakana_d2l.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
load_sakana_checkpoint ¶
load_sakana_checkpoint(
checkpoint_path: str | Path | None = None,
variant: str = DEFAULT_VARIANT,
device: str = "cpu",
) -> tuple[Any, Any]
Load Sakana's HyperLoRA perceiver from checkpoint.
Downloads from HuggingFace if no local path is provided. Patches flash attention for CPU/MPS compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint_path
|
str | Path | None
|
Path to local checkpoint. If None, downloads from HF. |
None
|
variant
|
str
|
HF checkpoint variant (only used if checkpoint_path is None). |
DEFAULT_VARIANT
|
device
|
str
|
Device to load onto. |
'cpu'
|
Returns:
| Type | Description |
|---|---|
tuple[Any, Any]
|
Tuple of (hypernet, hypernet_config). |
Source code in libs/model-training/src/model_training/sakana_d2l.py
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | |
transfer_aggregator_weights ¶
transfer_aggregator_weights(
hypernet: Any, checkpoint_path: str | Path
) -> Any
Load aggregator weights from a Sakana checkpoint into a HyperLoRA instance.
Loads only aggregator. weights from the checkpoint (not head.), freezes all aggregator parameters (requires_grad=False), and leaves head.* at PyTorch default initialization for Phase 29 training against the new target model.
This enables reuse of the pretrained Perceiver aggregator across different target model architectures. The aggregator maps document embeddings to LoRA weight space and is model-agnostic; only the head needs retraining per target model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hypernet
|
Any
|
The HyperLoRA model to load weights into (mutated in-place). |
required |
checkpoint_path
|
str | Path
|
Path to the Sakana checkpoint (.bin file). |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The mutated hypernet (returned for chaining convenience). |
Source code in libs/model-training/src/model_training/sakana_d2l.py
365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | |
get_aggregator_config ¶
get_aggregator_config(checkpoint_path: str | Path) -> Any
Extract the Perceiver aggregator structural config from a Sakana checkpoint.
Reads the aggregator_config from the checkpoint's HypernetConfig so that d2l_config.py can populate the aggregator_config=None placeholder set in Phase 25.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
checkpoint_path
|
str | Path
|
Path to the Sakana checkpoint (.bin file). |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The aggregator_config object from the checkpoint's HypernetConfig. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the checkpoint's aggregator_config is None (predates this field). |
Source code in libs/model-training/src/model_training/sakana_d2l.py
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 | |
extract_activations ¶
extract_activations(
text: str,
base_model_name: str,
layer_indices: list[int],
device: str = "cpu",
max_length: int = 512,
) -> tuple[Any, Any]
Extract per-layer hidden state activations from the base model.
Backward-compatible wrapper around extract_activations_with_model(). Loads model and tokenizer, delegates extraction, then cleans up.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to process. |
required |
base_model_name
|
str
|
HuggingFace model ID for the base model. |
required |
layer_indices
|
list[int]
|
Which layers to extract activations from. |
required |
device
|
str
|
Device for computation. |
'cpu'
|
max_length
|
int
|
Max token sequence length. |
512
|
Returns:
| Type | Description |
|---|---|
Any
|
Tuple of (features, attention_mask) ready for HyperLoRA. |
Any
|
features shape: (1, num_layers, seq_len, hidden_dim) |
tuple[Any, Any]
|
attention_mask shape: (1, seq_len) |
Source code in libs/model-training/src/model_training/sakana_d2l.py
453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 | |
generate_adapter_from_sakana ¶
generate_adapter_from_sakana(
text: str,
output_dir: str,
checkpoint_path: str | Path | None = None,
variant: str = DEFAULT_VARIANT,
base_model_name: str | None = None,
device: str = "cpu",
max_length: int = 512,
scaling_factor: float = 0.16,
) -> str
End-to-end: text → base model activations → HyperLoRA → PEFT adapter.
This is the main entry point. It: 1. Loads the Sakana pretrained perceiver (downloading if needed) 2. Runs text through the base model to get per-layer activations 3. Feeds activations through the perceiver to generate LoRA weights 4. Saves weights in PEFT-compatible format
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text (trajectory, document, context) to encode. |
required |
output_dir
|
str
|
Directory to save the PEFT adapter files. |
required |
checkpoint_path
|
str | Path | None
|
Path to local checkpoint, or None to download. |
None
|
variant
|
str
|
HF checkpoint variant if downloading. |
DEFAULT_VARIANT
|
base_model_name
|
str | None
|
Override base model. If None, uses the one from checkpoint. |
None
|
device
|
str
|
Device for computation. |
'cpu'
|
max_length
|
int
|
Maximum token sequence length for activation extraction. |
512
|
scaling_factor
|
float
|
Adapter scaling multiplier (0-1, default from config). |
0.16
|
Returns:
| Type | Description |
|---|---|
str
|
Path to the saved adapter directory. |
Source code in libs/model-training/src/model_training/sakana_d2l.py
536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 | |
trainer ¶
QLoRA training orchestrator.
All GPU-dependent imports (datasets, transformers, trl, torch) are deferred inside function bodies to ensure CPU-only importability (INFRA-05).
Module-level imports: stdlib only.
Functions¶
train_qlora ¶
train_qlora(
session_id: str,
adapter_id: str,
output_dir: str,
*,
base_model_id: str | None = None,
task_type: str = "code-gen",
rank: int = 64,
alpha: int = 128,
epochs: int = 3,
learning_rate: float = 0.0002,
) -> str
Train a QLoRA adapter from a recorded coding trajectory.
Orchestrates the full training pipeline: load trajectory, format as SFT messages, build dataset, load model with NF4 quantization, train with SFT, and save the adapter to output_dir.
All GPU imports are deferred to this function body; the module is safe to import in CPU-only environments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session_id
|
str
|
Trajectory session ID to train from. |
required |
adapter_id
|
str
|
Unique identifier for the resulting adapter. |
required |
output_dir
|
str
|
Directory to save the trained adapter weights. |
required |
base_model_id
|
str | None
|
HuggingFace model ID. Defaults to RUNE_BASE_MODEL env var or "Qwen/Qwen2.5-Coder-7B-Instruct". |
None
|
task_type
|
str
|
Task category (e.g. 'code-gen', 'bug-fix'). |
'code-gen'
|
rank
|
int
|
LoRA rank. |
64
|
alpha
|
int
|
LoRA alpha scaling factor. |
128
|
epochs
|
int
|
Number of training epochs. |
3
|
learning_rate
|
float
|
Optimizer learning rate. |
0.0002
|
Returns:
| Type | Description |
|---|---|
str
|
output_dir path where the adapter was saved. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the trajectory file does not exist. |
ValueError
|
If the trajectory is not successful or has no SFT messages. |
Source code in libs/model-training/src/model_training/trainer.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
train_and_register ¶
train_and_register(
session_id: str,
adapter_id: str,
*,
base_model_id: str | None = None,
task_type: str = "code-gen",
rank: int = 64,
alpha: int = 128,
epochs: int = 3,
learning_rate: float = 0.0002,
database_url: str | None = None,
) -> str
Train a QLoRA adapter and register it in the AdapterRegistry.
Combines train_qlora() with AdapterRegistry.store() to produce a fully registered adapter ready for vLLM serving.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session_id
|
str
|
Trajectory session ID to train from. |
required |
adapter_id
|
str
|
Unique identifier for the resulting adapter. |
required |
base_model_id
|
str | None
|
HuggingFace model ID. Defaults to RUNE_BASE_MODEL env var. |
None
|
task_type
|
str
|
Task category (e.g. 'code-gen', 'bug-fix'). |
'code-gen'
|
rank
|
int
|
LoRA rank. |
64
|
alpha
|
int
|
LoRA alpha scaling factor. |
128
|
epochs
|
int
|
Number of training epochs. |
3
|
learning_rate
|
float
|
Optimizer learning rate. |
0.0002
|
database_url
|
str | None
|
SQLAlchemy database URL. Defaults to RUNE_DATABASE_URL env var or "sqlite:///{home}/.rune/rune.db". |
None
|
Returns:
| Type | Description |
|---|---|
str
|
adapter_id of the registered adapter. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the trajectory file does not exist. |
ValueError
|
If the trajectory is not successful or has no SFT messages. |
Source code in libs/model-training/src/model_training/trainer.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | |
trajectory ¶
Trajectory recording and formatting for coding session distillation.
Provides functions to persist, load, and convert coding session trajectories into SFT-compatible chat format for LoRA fine-tuning pipelines.
Functions¶
record_trajectory ¶
record_trajectory(
session_id: str,
steps: list[dict[str, Any]],
outcome: Optional[str] = None,
*,
task_description: str = "",
task_type: str = "",
adapter_ids: list[str] | None = None,
) -> dict[str, Any]
Persist a coding session trajectory to disk for future distillation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session_id
|
str
|
Unique identifier for the coding session. |
required |
steps
|
list[dict[str, Any]]
|
List of step dicts, each containing attempt results. |
required |
outcome
|
Optional[str]
|
Final session result ('success', 'exhausted', or None). |
None
|
task_description
|
str
|
Natural language description of the coding task. |
''
|
task_type
|
str
|
Category of task (e.g. 'function', 'class', 'refactor'). |
''
|
adapter_ids
|
list[str] | None
|
LoRA adapter IDs used during the session. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict with 'session_id' and 'file_path' keys. |
Source code in libs/model-training/src/model_training/trajectory.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
load_trajectory ¶
load_trajectory(trajectory_id: str) -> dict[str, Any]
Load a stored trajectory by session ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory_id
|
str
|
The session ID used as the filename (without .json). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict containing the full trajectory data including steps and metadata. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If no trajectory file exists for the given ID. |
Source code in libs/model-training/src/model_training/trajectory.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
format_for_sft ¶
format_for_sft(
trajectory: dict[str, Any],
) -> list[dict[str, str]]
Convert a trajectory into SFT-compatible chat format.
Only successful trajectories (outcome == 'success') produce output. Extracts the final step where tests_passed is True as the assistant message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
dict[str, Any]
|
A trajectory dict as returned by load_trajectory. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, str]]
|
A list of 3 message dicts ([system, user, assistant]) for successful |
list[dict[str, str]]
|
trajectories, or an empty list if the trajectory did not succeed. |
Source code in libs/model-training/src/model_training/trajectory.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | |