API Reference for evaluation

evaluation

Evaluation library for adapter benchmarking and fitness scoring.

Provides functions for running benchmarks (HumanEval), calculating metrics (Pass@k, quality scores), comparing adapters, testing generalization, and computing evolutionary fitness for the evolution operator.

Functions

calculate_pass_at_k

calculate_pass_at_k(
    n_samples: int, n_correct: int, k: int = 1
) -> float

Calculate the Pass@k metric for code generation evaluation.

Computes the probability that at least one of k sampled solutions is correct, using the unbiased estimator from the HumanEval paper (Chen et al., 2021). This avoids the high-variance naive estimator.

Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))

Parameters:

Name Type Description Default
n_samples int

Total number of samples generated per problem.

required
n_correct int

Number of correct samples out of n_samples.

required
k int

Number of attempts allowed for the Pass@k metric. Defaults to 1.

1

Returns:

Type Description
float

Pass@k probability as a float between 0.0 and 1.0.

Raises:

Type Description
ValueError

If n_correct > n_samples.

Example

score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1) score 0.85

Source code in libs/evaluation/src/evaluation/metrics.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def calculate_pass_at_k(n_samples: int, n_correct: int, k: int = 1) -> float:
    """Calculate the Pass@k metric for code generation evaluation.

    Computes the probability that at least one of k sampled solutions is
    correct, using the unbiased estimator from the HumanEval paper
    (Chen et al., 2021). This avoids the high-variance naive estimator.

    Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))

    Args:
        n_samples: Total number of samples generated per problem.
        n_correct: Number of correct samples out of n_samples.
        k: Number of attempts allowed for the Pass@k metric. Defaults to 1.

    Returns:
        Pass@k probability as a float between 0.0 and 1.0.

    Raises:
        ValueError: If n_correct > n_samples.

    Example:
        >>> score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1)
        >>> score
        0.85
    """
    if n_correct > n_samples:
        raise ValueError(
            f"n_correct ({n_correct}) cannot be greater than n_samples ({n_samples})"
        )
    if n_correct == n_samples:
        return 1.0
    if n_samples - n_correct < k:
        return 1.0
    # Unbiased estimator: 1 - prod((n-c-i)/(n-i) for i in range(k))
    product = math.prod((n_samples - n_correct - i) / (n_samples - i) for i in range(k))
    return 1.0 - product

compare_adapters

compare_adapters(
    adapter_ids: list[str], benchmark: str = "humaneval"
) -> dict[str, Any]

Compare multiple adapters head-to-head on a benchmark.

Evaluates each adapter in the list on the specified benchmark and produces a comparative report with per-adapter scores, rankings, and a summary of which adapter performs best.

Parameters:

Name Type Description Default
adapter_ids list[str]

List of adapter UUIDs to compare. Must contain at least two adapter IDs.

required
benchmark str

Benchmark name to use for comparison. Currently supports "humaneval". Defaults to "humaneval".

'humaneval'

Returns:

Type Description
dict[str, Any]

Dictionary with comparison results including: - "scores": dict mapping adapter_id to its benchmark score - "rankings": list of adapter_ids sorted best-to-worst - "best_adapter": str, UUID of the top-performing adapter - "summary": str, human-readable comparison summary

Raises:

Type Description
NotImplementedError

compare_adapters is not yet implemented.

Example

results = compare_adapters(["adapter-001", "adapter-002"]) results["best_adapter"] in ["adapter-001", "adapter-002"] True results["best_adapter"] in results["rankings"] True

Source code in libs/evaluation/src/evaluation/metrics.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
def compare_adapters(
    adapter_ids: list[str],
    benchmark: str = "humaneval",
) -> dict[str, Any]:
    """Compare multiple adapters head-to-head on a benchmark.

    Evaluates each adapter in the list on the specified benchmark and produces
    a comparative report with per-adapter scores, rankings, and a summary of
    which adapter performs best.

    Args:
        adapter_ids: List of adapter UUIDs to compare. Must contain at least
            two adapter IDs.
        benchmark: Benchmark name to use for comparison. Currently supports
            "humaneval". Defaults to "humaneval".

    Returns:
        Dictionary with comparison results including:
            - "scores": dict mapping adapter_id to its benchmark score
            - "rankings": list of adapter_ids sorted best-to-worst
            - "best_adapter": str, UUID of the top-performing adapter
            - "summary": str, human-readable comparison summary

    Raises:
        NotImplementedError: compare_adapters is not yet implemented.

    Example:
        >>> results = compare_adapters(["adapter-001", "adapter-002"])
        >>> results["best_adapter"] in ["adapter-001", "adapter-002"]
        True
        >>> results["best_adapter"] in results["rankings"]
        True
    """
    if len(adapter_ids) < 2:
        raise ValueError("compare_adapters requires at least 2 adapter IDs")

    # Without live inference, return a stub comparison based on adapter order
    scores: dict[str, float] = {}
    for i, aid in enumerate(adapter_ids):
        scores[aid] = 1.0 / (i + 1)  # Placeholder scoring

    rankings = sorted(scores, key=lambda x: scores[x], reverse=True)
    best = rankings[0]
    summary = f"Compared {len(adapter_ids)} adapters on {benchmark}; best={best}"

    return {
        "scores": scores,
        "rankings": rankings,
        "best_adapter": best,
        "summary": summary,
    }

evaluate_fitness

evaluate_fitness(
    adapter_id: str,
    pass_rate: float,
    diversity_score: float = 0.0,
) -> float

Calculate evolutionary fitness score for the evolution operator.

Computes a composite fitness score used by the evolutionary algorithm to rank and select adapters for mutation and crossover. Balances raw performance (pass_rate) with adapter diversity to avoid population collapse.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to evaluate.

required
pass_rate float

Fraction of benchmark tasks passed, in range 0.0 to 1.0.

required
diversity_score float

Adapter uniqueness metric representing how different this adapter is from others in the current population. Range 0.0 to 1.0; higher values indicate more unique adapters. Defaults to 0.0.

0.0

Returns:

Type Description
float

Fitness score as a float between 0.0 and 1.0. Higher values indicate

float

adapters more likely to be selected for the next evolutionary generation.

Raises:

Type Description
NotImplementedError

evaluate_fitness is not yet implemented.

Example

fitness = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.3, ... ) fitness 0.795 low_diversity = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.0, ... ) low_diversity < fitness True

Source code in libs/evaluation/src/evaluation/metrics.py
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
def evaluate_fitness(
    adapter_id: str,
    pass_rate: float,
    diversity_score: float = 0.0,
) -> float:
    """Calculate evolutionary fitness score for the evolution operator.

    Computes a composite fitness score used by the evolutionary algorithm to
    rank and select adapters for mutation and crossover. Balances raw
    performance (pass_rate) with adapter diversity to avoid population collapse.

    Args:
        adapter_id: UUID of the adapter to evaluate.
        pass_rate: Fraction of benchmark tasks passed, in range 0.0 to 1.0.
        diversity_score: Adapter uniqueness metric representing how different
            this adapter is from others in the current population. Range 0.0
            to 1.0; higher values indicate more unique adapters. Defaults to 0.0.

    Returns:
        Fitness score as a float between 0.0 and 1.0. Higher values indicate
        adapters more likely to be selected for the next evolutionary generation.

    Raises:
        NotImplementedError: evaluate_fitness is not yet implemented.

    Example:
        >>> fitness = evaluate_fitness(
        ...     "adapter-001", pass_rate=0.85, diversity_score=0.3,
        ... )
        >>> fitness
        0.795
        >>> low_diversity = evaluate_fitness(
        ...     "adapter-001", pass_rate=0.85, diversity_score=0.0,
        ... )
        >>> low_diversity < fitness
        True
    """
    fitness = 0.7 * pass_rate + 0.3 * diversity_score
    return fitness

run_humaneval_subset

run_humaneval_subset(
    adapter_id: Optional[str],
    subset_size: int = 20,
    model: str = "Qwen/Qwen2.5-Coder-7B",
    completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]

Run a HumanEval benchmark subset to evaluate an adapter.

Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates them using the provided completions. For each task, concatenates prompt + completion + test + check(entry_point), writes to a temp script, and executes via subprocess. Exit code 0 = passed.

Parameters:

Name Type Description Default
adapter_id Optional[str]

UUID of the adapter to test (None = baseline, no adapter). Currently informational; inference wiring happens at a higher level.

required
subset_size int

Ignored — always uses the fixed 20-task bundled subset.

20
model str

Base model name (informational only, not used for inference here).

'Qwen/Qwen2.5-Coder-7B'
completions Optional[dict[str, str]]

Dict mapping task_id -> completion string. Only tasks with an entry in this dict are evaluated. If None or empty, returns empty results.

None

Returns:

Type Description
dict[str, Any]

Dictionary with benchmark results including: - "pass_count": int, number of tasks passed - "fail_count": int, number of tasks failed - "pass_rate": float, fraction of tasks passed (0.0 to 1.0) - "task_results": list of per-task result dicts with task_id, passed - "summary": str, human-readable summary

Raises:

Type Description
NotImplementedError

If completions are not provided (inference not wired).

Example

completions = {"HumanEval/0": " return []"} results = run_humaneval_subset(adapter_id=None, completions=completions) results["pass_count"] 0

Source code in libs/evaluation/src/evaluation/metrics.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def run_humaneval_subset(
    adapter_id: Optional[str],
    subset_size: int = 20,
    model: str = "Qwen/Qwen2.5-Coder-7B",
    completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]:
    """Run a HumanEval benchmark subset to evaluate an adapter.

    Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates
    them using the provided completions. For each task, concatenates
    prompt + completion + test + check(entry_point), writes to a temp script,
    and executes via subprocess. Exit code 0 = passed.

    Args:
        adapter_id: UUID of the adapter to test (None = baseline, no adapter).
            Currently informational; inference wiring happens at a higher level.
        subset_size: Ignored — always uses the fixed 20-task bundled subset.
        model: Base model name (informational only, not used for inference here).
        completions: Dict mapping task_id -> completion string. Only tasks with
            an entry in this dict are evaluated. If None or empty, returns empty
            results.

    Returns:
        Dictionary with benchmark results including:
            - "pass_count": int, number of tasks passed
            - "fail_count": int, number of tasks failed
            - "pass_rate": float, fraction of tasks passed (0.0 to 1.0)
            - "task_results": list of per-task result dicts with task_id, passed
            - "summary": str, human-readable summary

    Raises:
        NotImplementedError: If completions are not provided (inference not wired).

    Example:
        >>> completions = {"HumanEval/0": "    return []"}
        >>> results = run_humaneval_subset(adapter_id=None, completions=completions)
        >>> results["pass_count"]
        0
    """
    if completions is None:
        raise NotImplementedError("run_humaneval_subset is not yet implemented.")

    # Load bundled task data
    subset_path = _DATA_DIR / "humaneval_subset.json"
    with subset_path.open() as f:
        all_tasks: list[dict[str, str]] = json.load(f)

    # Build lookup by task_id
    task_map = {t["task_id"]: t for t in all_tasks}

    task_results: list[dict[str, Any]] = []

    with tempfile.TemporaryDirectory() as tmpdir:
        for task_id, completion in completions.items():
            task = task_map.get(task_id)
            if task is None:
                continue

            # Build executable script: prompt + completion + test + check(entry_point)
            script = (
                task["prompt"]
                + completion
                + "\n"
                + task["test"]
                + f"\ncheck({task['entry_point']})\n"
            )

            script_path = Path(tmpdir) / f"{task_id.replace('/', '_')}.py"
            script_path.write_text(script)

            passed = safe_subprocess_run(script_path, cwd=tmpdir)
            task_results.append({"task_id": task_id, "passed": passed})

    pass_count = sum(1 for r in task_results if r["passed"])
    fail_count = len(task_results) - pass_count
    total = len(task_results)
    pass_rate = pass_count / total if total > 0 else 0.0

    summary = (
        f"HumanEval subset: {pass_count}/{total} passed "
        f"(pass_rate={pass_rate:.2%}, adapter_id={adapter_id})"
    )
    print(summary)

    return {
        "pass_count": pass_count,
        "fail_count": fail_count,
        "pass_rate": pass_rate,
        "task_results": task_results,
        "summary": summary,
    }

run_kill_switch_gate

run_kill_switch_gate(
    baseline_pass1: float,
    adapter_pass1: float,
    threshold: float = 0.05,
) -> dict[str, object]

Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.

Computes the relative improvement of the adapter over the baseline and returns PASS if the improvement meets the threshold, FAIL otherwise.

Parameters:

Name Type Description Default
baseline_pass1 float

Pass@1 score for the baseline model (no adapter).

required
adapter_pass1 float

Pass@1 score for the adapter model.

required
threshold float

Minimum required relative improvement (default 0.05 = 5%).

0.05

Returns:

Type Description
dict[str, object]

Dictionary with: - "baseline_pass1": float - "adapter_pass1": float - "relative_delta": float, relative improvement over baseline - "verdict": str, "PASS" or "FAIL"

Example

result = run_kill_switch_gate(0.50, 0.55) result["verdict"] 'PASS'

Source code in libs/evaluation/src/evaluation/metrics.py
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def run_kill_switch_gate(
    baseline_pass1: float,
    adapter_pass1: float,
    threshold: float = 0.05,
) -> dict[str, object]:
    """Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.

    Computes the relative improvement of the adapter over the baseline and
    returns PASS if the improvement meets the threshold, FAIL otherwise.

    Args:
        baseline_pass1: Pass@1 score for the baseline model (no adapter).
        adapter_pass1: Pass@1 score for the adapter model.
        threshold: Minimum required relative improvement (default 0.05 = 5%).

    Returns:
        Dictionary with:
            - "baseline_pass1": float
            - "adapter_pass1": float
            - "relative_delta": float, relative improvement over baseline
            - "verdict": str, "PASS" or "FAIL"

    Example:
        >>> result = run_kill_switch_gate(0.50, 0.55)
        >>> result["verdict"]
        'PASS'
    """
    # 1e-9 prevents division-by-zero when baseline is 0.0
    relative_delta = (adapter_pass1 - baseline_pass1) / max(baseline_pass1, 1e-9)
    verdict = "PASS" if adapter_pass1 >= baseline_pass1 * (1 + threshold) else "FAIL"

    print(
        f"Kill-switch gate result: {verdict}\n"
        f"  Baseline  Pass@1: {baseline_pass1:.4f}\n"
        f"  Adapter   Pass@1: {adapter_pass1:.4f}\n"
        f"  Relative delta:   {relative_delta:+.2%} (threshold: {threshold:+.0%})"
    )

    return {
        "baseline_pass1": baseline_pass1,
        "adapter_pass1": adapter_pass1,
        "relative_delta": relative_delta,
        "verdict": verdict,
    }

score_adapter_quality

score_adapter_quality(
    adapter_id: str,
    pass_rate: float,
    generalization_delta: float | None = None,
) -> float

Compute an overall quality score for an adapter.

Aggregates the adapter's benchmark pass rate with its generalization performance (if available) into a single scalar quality score. When generalization data is absent, quality is derived from pass rate alone.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to score.

required
pass_rate float

Fraction of benchmark tasks passed, in range 0.0 to 1.0.

required
generalization_delta float | None

Optional difference between in-distribution and out-of-distribution (OOD) performance. Positive values indicate the adapter generalizes well; negative values indicate overfitting. Defaults to None (not measured).

None

Returns:

Type Description
float

Quality score as a float between 0.0 and 1.0. Higher values indicate

float

a better overall adapter quality.

Raises:

Type Description
NotImplementedError

score_adapter_quality is not yet implemented.

Example

score = score_adapter_quality("adapter-001", pass_rate=0.85) score 0.85 score_with_gen = score_adapter_quality( ... "adapter-001", 0.85, generalization_delta=0.1, ... ) score_with_gen > score True

Source code in libs/evaluation/src/evaluation/metrics.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def score_adapter_quality(
    adapter_id: str,
    pass_rate: float,
    generalization_delta: float | None = None,
) -> float:
    """Compute an overall quality score for an adapter.

    Aggregates the adapter's benchmark pass rate with its generalization
    performance (if available) into a single scalar quality score. When
    generalization data is absent, quality is derived from pass rate alone.

    Args:
        adapter_id: UUID of the adapter to score.
        pass_rate: Fraction of benchmark tasks passed, in range 0.0 to 1.0.
        generalization_delta: Optional difference between in-distribution and
            out-of-distribution (OOD) performance. Positive values indicate
            the adapter generalizes well; negative values indicate overfitting.
            Defaults to None (not measured).

    Returns:
        Quality score as a float between 0.0 and 1.0. Higher values indicate
        a better overall adapter quality.

    Raises:
        NotImplementedError: score_adapter_quality is not yet implemented.

    Example:
        >>> score = score_adapter_quality("adapter-001", pass_rate=0.85)
        >>> score
        0.85
        >>> score_with_gen = score_adapter_quality(
        ...     "adapter-001", 0.85, generalization_delta=0.1,
        ... )
        >>> score_with_gen > score
        True
    """
    if generalization_delta is not None:
        quality = min(pass_rate + 0.1 * max(generalization_delta, 0), 1.0)
    else:
        quality = pass_rate
    return quality

test_generalization

test_generalization(
    adapter_id: str,
    in_distribution_tasks: list[str] | None = None,
    ood_tasks: list[str] | None = None,
) -> dict[str, Any]

Test whether an adapter generalizes beyond its training distribution.

Evaluates the adapter on both in-distribution tasks (matching the training data distribution) and out-of-distribution (OOD) tasks to measure how well the adapter generalizes to novel problems.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to evaluate.

required
in_distribution_tasks list[str] | None

Optional list of task IDs matching the adapter's training distribution. If None, uses a default in-distribution set.

None
ood_tasks list[str] | None

Optional list of out-of-distribution task IDs to test generalization on. If None, uses a default OOD task set.

None

Returns:

Type Description
dict[str, Any]

Dictionary with generalization results including: - "in_distribution_score": float, performance on training-distribution tasks - "ood_score": float, performance on out-of-distribution tasks - "generalization_delta": float, difference (in_distribution - ood) - "generalizes": bool, True if generalization_delta is within threshold

Raises:

Type Description
NotImplementedError

test_generalization is not yet implemented.

Example

results = test_generalization("adapter-001") results["generalization_delta"] 0.05 results["generalizes"] True

Source code in libs/evaluation/src/evaluation/metrics.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def test_generalization(
    adapter_id: str,
    in_distribution_tasks: list[str] | None = None,
    ood_tasks: list[str] | None = None,
) -> dict[str, Any]:
    """Test whether an adapter generalizes beyond its training distribution.

    Evaluates the adapter on both in-distribution tasks (matching the training
    data distribution) and out-of-distribution (OOD) tasks to measure how well
    the adapter generalizes to novel problems.

    Args:
        adapter_id: UUID of the adapter to evaluate.
        in_distribution_tasks: Optional list of task IDs matching the adapter's
            training distribution. If None, uses a default in-distribution set.
        ood_tasks: Optional list of out-of-distribution task IDs to test
            generalization on. If None, uses a default OOD task set.

    Returns:
        Dictionary with generalization results including:
            - "in_distribution_score": float, performance on training-distribution tasks
            - "ood_score": float, performance on out-of-distribution tasks
            - "generalization_delta": float, difference (in_distribution - ood)
            - "generalizes": bool, True if generalization_delta is within threshold

    Raises:
        NotImplementedError: test_generalization is not yet implemented.

    Example:
        >>> results = test_generalization("adapter-001")
        >>> results["generalization_delta"]
        0.05
        >>> results["generalizes"]
        True
    """
    in_dist_score = 0.8  # Placeholder until live inference wired
    ood_score = 0.6
    gen_delta = in_dist_score - ood_score
    generalizes = abs(gen_delta) <= 0.2

    return {
        "adapter_id": adapter_id,
        "in_distribution_score": in_dist_score,
        "ood_score": ood_score,
        "generalization_delta": gen_delta,
        "generalizes": generalizes,
    }

compute_generalization_delta

compute_generalization_delta(
    in_dist_rate: float, ood_rate: float
) -> float

Compute the generalization delta between in-distribution and OOD performance.

A positive delta means the adapter does better on OOD tasks relative to in-distribution performance. A negative delta indicates overfitting.

Parameters:

Name Type Description Default
in_dist_rate float

Pass rate on in-distribution tasks (0.0 to 1.0).

required
ood_rate float

Pass rate on OOD tasks (0.0 to 1.0).

required

Returns:

Type Description
float

Generalization delta as ood_rate - in_dist_rate.

Source code in libs/evaluation/src/evaluation/ood_benchmark.py
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def compute_generalization_delta(
    in_dist_rate: float,
    ood_rate: float,
) -> float:
    """Compute the generalization delta between in-distribution and OOD performance.

    A positive delta means the adapter does better on OOD tasks relative to
    in-distribution performance. A negative delta indicates overfitting.

    Args:
        in_dist_rate: Pass rate on in-distribution tasks (0.0 to 1.0).
        ood_rate: Pass rate on OOD tasks (0.0 to 1.0).

    Returns:
        Generalization delta as ood_rate - in_dist_rate.
    """
    return ood_rate - in_dist_rate

run_ood_benchmark

run_ood_benchmark(
    adapter_id: str | None,
    completions: dict[str, str],
    benchmark_name: str = "ood_python",
) -> dict[str, Any]

Run an out-of-distribution benchmark on provided completions.

Evaluates completions against OOD tasks from the bundled task set. Each task's prompt + completion is executed in a subprocess with its test harness.

Parameters:

Name Type Description Default
adapter_id str | None

UUID of the adapter being tested (informational).

required
completions dict[str, str]

Dict mapping task_id to completion string.

required
benchmark_name str

Name of the OOD benchmark set.

'ood_python'

Returns:

Type Description
dict[str, Any]

Dictionary with ood_pass_rate and per-task results.

Source code in libs/evaluation/src/evaluation/ood_benchmark.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def run_ood_benchmark(
    adapter_id: str | None,
    completions: dict[str, str],
    benchmark_name: str = "ood_python",
) -> dict[str, Any]:
    """Run an out-of-distribution benchmark on provided completions.

    Evaluates completions against OOD tasks from the bundled task set.
    Each task's prompt + completion is executed in a subprocess with its
    test harness.

    Args:
        adapter_id: UUID of the adapter being tested (informational).
        completions: Dict mapping task_id to completion string.
        benchmark_name: Name of the OOD benchmark set.

    Returns:
        Dictionary with ood_pass_rate and per-task results.
    """
    ood_path = _OOD_DATA_DIR / "ood_tasks.json"
    with ood_path.open() as f:
        all_tasks: list[dict[str, str]] = json.load(f)

    task_map = {t["task_id"]: t for t in all_tasks}
    task_results: list[dict[str, Any]] = []

    with tempfile.TemporaryDirectory() as tmpdir:
        for task_id, completion in completions.items():
            task = task_map.get(task_id)
            if task is None:
                continue

            script = task["prompt"] + completion + "\n" + task["test"] + "\n"
            script_path = Path(tmpdir) / f"{task_id.replace('/', '_')}.py"
            script_path.write_text(script)

            passed = safe_subprocess_run(script_path, cwd=tmpdir)

            task_results.append({"task_id": task_id, "passed": passed})

    total = len(task_results)
    pass_count = sum(1 for r in task_results if r["passed"])
    ood_pass_rate = pass_count / total if total > 0 else 0.0

    return {
        "adapter_id": adapter_id,
        "benchmark_name": benchmark_name,
        "ood_pass_rate": ood_pass_rate,
        "pass_count": pass_count,
        "total": total,
        "task_results": task_results,
    }

Modules

metrics

Evaluation metrics for adapter benchmarking and fitness scoring.

Provides functions for running benchmarks (HumanEval), calculating metrics (Pass@k, quality scores), comparing adapters, testing generalization, and computing evolutionary fitness for the evolution operator.

Functions
calculate_pass_at_k
calculate_pass_at_k(
    n_samples: int, n_correct: int, k: int = 1
) -> float

Calculate the Pass@k metric for code generation evaluation.

Computes the probability that at least one of k sampled solutions is correct, using the unbiased estimator from the HumanEval paper (Chen et al., 2021). This avoids the high-variance naive estimator.

Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))

Parameters:

Name Type Description Default
n_samples int

Total number of samples generated per problem.

required
n_correct int

Number of correct samples out of n_samples.

required
k int

Number of attempts allowed for the Pass@k metric. Defaults to 1.

1

Returns:

Type Description
float

Pass@k probability as a float between 0.0 and 1.0.

Raises:

Type Description
ValueError

If n_correct > n_samples.

Example

score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1) score 0.85

Source code in libs/evaluation/src/evaluation/metrics.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def calculate_pass_at_k(n_samples: int, n_correct: int, k: int = 1) -> float:
    """Calculate the Pass@k metric for code generation evaluation.

    Computes the probability that at least one of k sampled solutions is
    correct, using the unbiased estimator from the HumanEval paper
    (Chen et al., 2021). This avoids the high-variance naive estimator.

    Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))

    Args:
        n_samples: Total number of samples generated per problem.
        n_correct: Number of correct samples out of n_samples.
        k: Number of attempts allowed for the Pass@k metric. Defaults to 1.

    Returns:
        Pass@k probability as a float between 0.0 and 1.0.

    Raises:
        ValueError: If n_correct > n_samples.

    Example:
        >>> score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1)
        >>> score
        0.85
    """
    if n_correct > n_samples:
        raise ValueError(
            f"n_correct ({n_correct}) cannot be greater than n_samples ({n_samples})"
        )
    if n_correct == n_samples:
        return 1.0
    if n_samples - n_correct < k:
        return 1.0
    # Unbiased estimator: 1 - prod((n-c-i)/(n-i) for i in range(k))
    product = math.prod((n_samples - n_correct - i) / (n_samples - i) for i in range(k))
    return 1.0 - product
run_kill_switch_gate
run_kill_switch_gate(
    baseline_pass1: float,
    adapter_pass1: float,
    threshold: float = 0.05,
) -> dict[str, object]

Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.

Computes the relative improvement of the adapter over the baseline and returns PASS if the improvement meets the threshold, FAIL otherwise.

Parameters:

Name Type Description Default
baseline_pass1 float

Pass@1 score for the baseline model (no adapter).

required
adapter_pass1 float

Pass@1 score for the adapter model.

required
threshold float

Minimum required relative improvement (default 0.05 = 5%).

0.05

Returns:

Type Description
dict[str, object]

Dictionary with: - "baseline_pass1": float - "adapter_pass1": float - "relative_delta": float, relative improvement over baseline - "verdict": str, "PASS" or "FAIL"

Example

result = run_kill_switch_gate(0.50, 0.55) result["verdict"] 'PASS'

Source code in libs/evaluation/src/evaluation/metrics.py
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def run_kill_switch_gate(
    baseline_pass1: float,
    adapter_pass1: float,
    threshold: float = 0.05,
) -> dict[str, object]:
    """Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.

    Computes the relative improvement of the adapter over the baseline and
    returns PASS if the improvement meets the threshold, FAIL otherwise.

    Args:
        baseline_pass1: Pass@1 score for the baseline model (no adapter).
        adapter_pass1: Pass@1 score for the adapter model.
        threshold: Minimum required relative improvement (default 0.05 = 5%).

    Returns:
        Dictionary with:
            - "baseline_pass1": float
            - "adapter_pass1": float
            - "relative_delta": float, relative improvement over baseline
            - "verdict": str, "PASS" or "FAIL"

    Example:
        >>> result = run_kill_switch_gate(0.50, 0.55)
        >>> result["verdict"]
        'PASS'
    """
    # 1e-9 prevents division-by-zero when baseline is 0.0
    relative_delta = (adapter_pass1 - baseline_pass1) / max(baseline_pass1, 1e-9)
    verdict = "PASS" if adapter_pass1 >= baseline_pass1 * (1 + threshold) else "FAIL"

    print(
        f"Kill-switch gate result: {verdict}\n"
        f"  Baseline  Pass@1: {baseline_pass1:.4f}\n"
        f"  Adapter   Pass@1: {adapter_pass1:.4f}\n"
        f"  Relative delta:   {relative_delta:+.2%} (threshold: {threshold:+.0%})"
    )

    return {
        "baseline_pass1": baseline_pass1,
        "adapter_pass1": adapter_pass1,
        "relative_delta": relative_delta,
        "verdict": verdict,
    }
run_humaneval_subset
run_humaneval_subset(
    adapter_id: Optional[str],
    subset_size: int = 20,
    model: str = "Qwen/Qwen2.5-Coder-7B",
    completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]

Run a HumanEval benchmark subset to evaluate an adapter.

Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates them using the provided completions. For each task, concatenates prompt + completion + test + check(entry_point), writes to a temp script, and executes via subprocess. Exit code 0 = passed.

Parameters:

Name Type Description Default
adapter_id Optional[str]

UUID of the adapter to test (None = baseline, no adapter). Currently informational; inference wiring happens at a higher level.

required
subset_size int

Ignored — always uses the fixed 20-task bundled subset.

20
model str

Base model name (informational only, not used for inference here).

'Qwen/Qwen2.5-Coder-7B'
completions Optional[dict[str, str]]

Dict mapping task_id -> completion string. Only tasks with an entry in this dict are evaluated. If None or empty, returns empty results.

None

Returns:

Type Description
dict[str, Any]

Dictionary with benchmark results including: - "pass_count": int, number of tasks passed - "fail_count": int, number of tasks failed - "pass_rate": float, fraction of tasks passed (0.0 to 1.0) - "task_results": list of per-task result dicts with task_id, passed - "summary": str, human-readable summary

Raises:

Type Description
NotImplementedError

If completions are not provided (inference not wired).

Example

completions = {"HumanEval/0": " return []"} results = run_humaneval_subset(adapter_id=None, completions=completions) results["pass_count"] 0

Source code in libs/evaluation/src/evaluation/metrics.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def run_humaneval_subset(
    adapter_id: Optional[str],
    subset_size: int = 20,
    model: str = "Qwen/Qwen2.5-Coder-7B",
    completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]:
    """Run a HumanEval benchmark subset to evaluate an adapter.

    Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates
    them using the provided completions. For each task, concatenates
    prompt + completion + test + check(entry_point), writes to a temp script,
    and executes via subprocess. Exit code 0 = passed.

    Args:
        adapter_id: UUID of the adapter to test (None = baseline, no adapter).
            Currently informational; inference wiring happens at a higher level.
        subset_size: Ignored — always uses the fixed 20-task bundled subset.
        model: Base model name (informational only, not used for inference here).
        completions: Dict mapping task_id -> completion string. Only tasks with
            an entry in this dict are evaluated. If None or empty, returns empty
            results.

    Returns:
        Dictionary with benchmark results including:
            - "pass_count": int, number of tasks passed
            - "fail_count": int, number of tasks failed
            - "pass_rate": float, fraction of tasks passed (0.0 to 1.0)
            - "task_results": list of per-task result dicts with task_id, passed
            - "summary": str, human-readable summary

    Raises:
        NotImplementedError: If completions are not provided (inference not wired).

    Example:
        >>> completions = {"HumanEval/0": "    return []"}
        >>> results = run_humaneval_subset(adapter_id=None, completions=completions)
        >>> results["pass_count"]
        0
    """
    if completions is None:
        raise NotImplementedError("run_humaneval_subset is not yet implemented.")

    # Load bundled task data
    subset_path = _DATA_DIR / "humaneval_subset.json"
    with subset_path.open() as f:
        all_tasks: list[dict[str, str]] = json.load(f)

    # Build lookup by task_id
    task_map = {t["task_id"]: t for t in all_tasks}

    task_results: list[dict[str, Any]] = []

    with tempfile.TemporaryDirectory() as tmpdir:
        for task_id, completion in completions.items():
            task = task_map.get(task_id)
            if task is None:
                continue

            # Build executable script: prompt + completion + test + check(entry_point)
            script = (
                task["prompt"]
                + completion
                + "\n"
                + task["test"]
                + f"\ncheck({task['entry_point']})\n"
            )

            script_path = Path(tmpdir) / f"{task_id.replace('/', '_')}.py"
            script_path.write_text(script)

            passed = safe_subprocess_run(script_path, cwd=tmpdir)
            task_results.append({"task_id": task_id, "passed": passed})

    pass_count = sum(1 for r in task_results if r["passed"])
    fail_count = len(task_results) - pass_count
    total = len(task_results)
    pass_rate = pass_count / total if total > 0 else 0.0

    summary = (
        f"HumanEval subset: {pass_count}/{total} passed "
        f"(pass_rate={pass_rate:.2%}, adapter_id={adapter_id})"
    )
    print(summary)

    return {
        "pass_count": pass_count,
        "fail_count": fail_count,
        "pass_rate": pass_rate,
        "task_results": task_results,
        "summary": summary,
    }
score_adapter_quality
score_adapter_quality(
    adapter_id: str,
    pass_rate: float,
    generalization_delta: float | None = None,
) -> float

Compute an overall quality score for an adapter.

Aggregates the adapter's benchmark pass rate with its generalization performance (if available) into a single scalar quality score. When generalization data is absent, quality is derived from pass rate alone.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to score.

required
pass_rate float

Fraction of benchmark tasks passed, in range 0.0 to 1.0.

required
generalization_delta float | None

Optional difference between in-distribution and out-of-distribution (OOD) performance. Positive values indicate the adapter generalizes well; negative values indicate overfitting. Defaults to None (not measured).

None

Returns:

Type Description
float

Quality score as a float between 0.0 and 1.0. Higher values indicate

float

a better overall adapter quality.

Raises:

Type Description
NotImplementedError

score_adapter_quality is not yet implemented.

Example

score = score_adapter_quality("adapter-001", pass_rate=0.85) score 0.85 score_with_gen = score_adapter_quality( ... "adapter-001", 0.85, generalization_delta=0.1, ... ) score_with_gen > score True

Source code in libs/evaluation/src/evaluation/metrics.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def score_adapter_quality(
    adapter_id: str,
    pass_rate: float,
    generalization_delta: float | None = None,
) -> float:
    """Compute an overall quality score for an adapter.

    Aggregates the adapter's benchmark pass rate with its generalization
    performance (if available) into a single scalar quality score. When
    generalization data is absent, quality is derived from pass rate alone.

    Args:
        adapter_id: UUID of the adapter to score.
        pass_rate: Fraction of benchmark tasks passed, in range 0.0 to 1.0.
        generalization_delta: Optional difference between in-distribution and
            out-of-distribution (OOD) performance. Positive values indicate
            the adapter generalizes well; negative values indicate overfitting.
            Defaults to None (not measured).

    Returns:
        Quality score as a float between 0.0 and 1.0. Higher values indicate
        a better overall adapter quality.

    Raises:
        NotImplementedError: score_adapter_quality is not yet implemented.

    Example:
        >>> score = score_adapter_quality("adapter-001", pass_rate=0.85)
        >>> score
        0.85
        >>> score_with_gen = score_adapter_quality(
        ...     "adapter-001", 0.85, generalization_delta=0.1,
        ... )
        >>> score_with_gen > score
        True
    """
    if generalization_delta is not None:
        quality = min(pass_rate + 0.1 * max(generalization_delta, 0), 1.0)
    else:
        quality = pass_rate
    return quality
compare_adapters
compare_adapters(
    adapter_ids: list[str], benchmark: str = "humaneval"
) -> dict[str, Any]

Compare multiple adapters head-to-head on a benchmark.

Evaluates each adapter in the list on the specified benchmark and produces a comparative report with per-adapter scores, rankings, and a summary of which adapter performs best.

Parameters:

Name Type Description Default
adapter_ids list[str]

List of adapter UUIDs to compare. Must contain at least two adapter IDs.

required
benchmark str

Benchmark name to use for comparison. Currently supports "humaneval". Defaults to "humaneval".

'humaneval'

Returns:

Type Description
dict[str, Any]

Dictionary with comparison results including: - "scores": dict mapping adapter_id to its benchmark score - "rankings": list of adapter_ids sorted best-to-worst - "best_adapter": str, UUID of the top-performing adapter - "summary": str, human-readable comparison summary

Raises:

Type Description
NotImplementedError

compare_adapters is not yet implemented.

Example

results = compare_adapters(["adapter-001", "adapter-002"]) results["best_adapter"] in ["adapter-001", "adapter-002"] True results["best_adapter"] in results["rankings"] True

Source code in libs/evaluation/src/evaluation/metrics.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
def compare_adapters(
    adapter_ids: list[str],
    benchmark: str = "humaneval",
) -> dict[str, Any]:
    """Compare multiple adapters head-to-head on a benchmark.

    Evaluates each adapter in the list on the specified benchmark and produces
    a comparative report with per-adapter scores, rankings, and a summary of
    which adapter performs best.

    Args:
        adapter_ids: List of adapter UUIDs to compare. Must contain at least
            two adapter IDs.
        benchmark: Benchmark name to use for comparison. Currently supports
            "humaneval". Defaults to "humaneval".

    Returns:
        Dictionary with comparison results including:
            - "scores": dict mapping adapter_id to its benchmark score
            - "rankings": list of adapter_ids sorted best-to-worst
            - "best_adapter": str, UUID of the top-performing adapter
            - "summary": str, human-readable comparison summary

    Raises:
        NotImplementedError: compare_adapters is not yet implemented.

    Example:
        >>> results = compare_adapters(["adapter-001", "adapter-002"])
        >>> results["best_adapter"] in ["adapter-001", "adapter-002"]
        True
        >>> results["best_adapter"] in results["rankings"]
        True
    """
    if len(adapter_ids) < 2:
        raise ValueError("compare_adapters requires at least 2 adapter IDs")

    # Without live inference, return a stub comparison based on adapter order
    scores: dict[str, float] = {}
    for i, aid in enumerate(adapter_ids):
        scores[aid] = 1.0 / (i + 1)  # Placeholder scoring

    rankings = sorted(scores, key=lambda x: scores[x], reverse=True)
    best = rankings[0]
    summary = f"Compared {len(adapter_ids)} adapters on {benchmark}; best={best}"

    return {
        "scores": scores,
        "rankings": rankings,
        "best_adapter": best,
        "summary": summary,
    }
test_generalization
test_generalization(
    adapter_id: str,
    in_distribution_tasks: list[str] | None = None,
    ood_tasks: list[str] | None = None,
) -> dict[str, Any]

Test whether an adapter generalizes beyond its training distribution.

Evaluates the adapter on both in-distribution tasks (matching the training data distribution) and out-of-distribution (OOD) tasks to measure how well the adapter generalizes to novel problems.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to evaluate.

required
in_distribution_tasks list[str] | None

Optional list of task IDs matching the adapter's training distribution. If None, uses a default in-distribution set.

None
ood_tasks list[str] | None

Optional list of out-of-distribution task IDs to test generalization on. If None, uses a default OOD task set.

None

Returns:

Type Description
dict[str, Any]

Dictionary with generalization results including: - "in_distribution_score": float, performance on training-distribution tasks - "ood_score": float, performance on out-of-distribution tasks - "generalization_delta": float, difference (in_distribution - ood) - "generalizes": bool, True if generalization_delta is within threshold

Raises:

Type Description
NotImplementedError

test_generalization is not yet implemented.

Example

results = test_generalization("adapter-001") results["generalization_delta"] 0.05 results["generalizes"] True

Source code in libs/evaluation/src/evaluation/metrics.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def test_generalization(
    adapter_id: str,
    in_distribution_tasks: list[str] | None = None,
    ood_tasks: list[str] | None = None,
) -> dict[str, Any]:
    """Test whether an adapter generalizes beyond its training distribution.

    Evaluates the adapter on both in-distribution tasks (matching the training
    data distribution) and out-of-distribution (OOD) tasks to measure how well
    the adapter generalizes to novel problems.

    Args:
        adapter_id: UUID of the adapter to evaluate.
        in_distribution_tasks: Optional list of task IDs matching the adapter's
            training distribution. If None, uses a default in-distribution set.
        ood_tasks: Optional list of out-of-distribution task IDs to test
            generalization on. If None, uses a default OOD task set.

    Returns:
        Dictionary with generalization results including:
            - "in_distribution_score": float, performance on training-distribution tasks
            - "ood_score": float, performance on out-of-distribution tasks
            - "generalization_delta": float, difference (in_distribution - ood)
            - "generalizes": bool, True if generalization_delta is within threshold

    Raises:
        NotImplementedError: test_generalization is not yet implemented.

    Example:
        >>> results = test_generalization("adapter-001")
        >>> results["generalization_delta"]
        0.05
        >>> results["generalizes"]
        True
    """
    in_dist_score = 0.8  # Placeholder until live inference wired
    ood_score = 0.6
    gen_delta = in_dist_score - ood_score
    generalizes = abs(gen_delta) <= 0.2

    return {
        "adapter_id": adapter_id,
        "in_distribution_score": in_dist_score,
        "ood_score": ood_score,
        "generalization_delta": gen_delta,
        "generalizes": generalizes,
    }
evaluate_fitness
evaluate_fitness(
    adapter_id: str,
    pass_rate: float,
    diversity_score: float = 0.0,
) -> float

Calculate evolutionary fitness score for the evolution operator.

Computes a composite fitness score used by the evolutionary algorithm to rank and select adapters for mutation and crossover. Balances raw performance (pass_rate) with adapter diversity to avoid population collapse.

Parameters:

Name Type Description Default
adapter_id str

UUID of the adapter to evaluate.

required
pass_rate float

Fraction of benchmark tasks passed, in range 0.0 to 1.0.

required
diversity_score float

Adapter uniqueness metric representing how different this adapter is from others in the current population. Range 0.0 to 1.0; higher values indicate more unique adapters. Defaults to 0.0.

0.0

Returns:

Type Description
float

Fitness score as a float between 0.0 and 1.0. Higher values indicate

float

adapters more likely to be selected for the next evolutionary generation.

Raises:

Type Description
NotImplementedError

evaluate_fitness is not yet implemented.

Example

fitness = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.3, ... ) fitness 0.795 low_diversity = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.0, ... ) low_diversity < fitness True

Source code in libs/evaluation/src/evaluation/metrics.py
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
def evaluate_fitness(
    adapter_id: str,
    pass_rate: float,
    diversity_score: float = 0.0,
) -> float:
    """Calculate evolutionary fitness score for the evolution operator.

    Computes a composite fitness score used by the evolutionary algorithm to
    rank and select adapters for mutation and crossover. Balances raw
    performance (pass_rate) with adapter diversity to avoid population collapse.

    Args:
        adapter_id: UUID of the adapter to evaluate.
        pass_rate: Fraction of benchmark tasks passed, in range 0.0 to 1.0.
        diversity_score: Adapter uniqueness metric representing how different
            this adapter is from others in the current population. Range 0.0
            to 1.0; higher values indicate more unique adapters. Defaults to 0.0.

    Returns:
        Fitness score as a float between 0.0 and 1.0. Higher values indicate
        adapters more likely to be selected for the next evolutionary generation.

    Raises:
        NotImplementedError: evaluate_fitness is not yet implemented.

    Example:
        >>> fitness = evaluate_fitness(
        ...     "adapter-001", pass_rate=0.85, diversity_score=0.3,
        ... )
        >>> fitness
        0.795
        >>> low_diversity = evaluate_fitness(
        ...     "adapter-001", pass_rate=0.85, diversity_score=0.0,
        ... )
        >>> low_diversity < fitness
        True
    """
    fitness = 0.7 * pass_rate + 0.3 * diversity_score
    return fitness

ood_benchmark

Out-of-distribution benchmark for adapter generalization testing.

Provides functions for evaluating adapter performance on tasks outside the training distribution, measuring generalization capability.

Functions
run_ood_benchmark
run_ood_benchmark(
    adapter_id: str | None,
    completions: dict[str, str],
    benchmark_name: str = "ood_python",
) -> dict[str, Any]

Run an out-of-distribution benchmark on provided completions.

Evaluates completions against OOD tasks from the bundled task set. Each task's prompt + completion is executed in a subprocess with its test harness.

Parameters:

Name Type Description Default
adapter_id str | None

UUID of the adapter being tested (informational).

required
completions dict[str, str]

Dict mapping task_id to completion string.

required
benchmark_name str

Name of the OOD benchmark set.

'ood_python'

Returns:

Type Description
dict[str, Any]

Dictionary with ood_pass_rate and per-task results.

Source code in libs/evaluation/src/evaluation/ood_benchmark.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def run_ood_benchmark(
    adapter_id: str | None,
    completions: dict[str, str],
    benchmark_name: str = "ood_python",
) -> dict[str, Any]:
    """Run an out-of-distribution benchmark on provided completions.

    Evaluates completions against OOD tasks from the bundled task set.
    Each task's prompt + completion is executed in a subprocess with its
    test harness.

    Args:
        adapter_id: UUID of the adapter being tested (informational).
        completions: Dict mapping task_id to completion string.
        benchmark_name: Name of the OOD benchmark set.

    Returns:
        Dictionary with ood_pass_rate and per-task results.
    """
    ood_path = _OOD_DATA_DIR / "ood_tasks.json"
    with ood_path.open() as f:
        all_tasks: list[dict[str, str]] = json.load(f)

    task_map = {t["task_id"]: t for t in all_tasks}
    task_results: list[dict[str, Any]] = []

    with tempfile.TemporaryDirectory() as tmpdir:
        for task_id, completion in completions.items():
            task = task_map.get(task_id)
            if task is None:
                continue

            script = task["prompt"] + completion + "\n" + task["test"] + "\n"
            script_path = Path(tmpdir) / f"{task_id.replace('/', '_')}.py"
            script_path.write_text(script)

            passed = safe_subprocess_run(script_path, cwd=tmpdir)

            task_results.append({"task_id": task_id, "passed": passed})

    total = len(task_results)
    pass_count = sum(1 for r in task_results if r["passed"])
    ood_pass_rate = pass_count / total if total > 0 else 0.0

    return {
        "adapter_id": adapter_id,
        "benchmark_name": benchmark_name,
        "ood_pass_rate": ood_pass_rate,
        "pass_count": pass_count,
        "total": total,
        "task_results": task_results,
    }
compute_generalization_delta
compute_generalization_delta(
    in_dist_rate: float, ood_rate: float
) -> float

Compute the generalization delta between in-distribution and OOD performance.

A positive delta means the adapter does better on OOD tasks relative to in-distribution performance. A negative delta indicates overfitting.

Parameters:

Name Type Description Default
in_dist_rate float

Pass rate on in-distribution tasks (0.0 to 1.0).

required
ood_rate float

Pass rate on OOD tasks (0.0 to 1.0).

required

Returns:

Type Description
float

Generalization delta as ood_rate - in_dist_rate.

Source code in libs/evaluation/src/evaluation/ood_benchmark.py
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def compute_generalization_delta(
    in_dist_rate: float,
    ood_rate: float,
) -> float:
    """Compute the generalization delta between in-distribution and OOD performance.

    A positive delta means the adapter does better on OOD tasks relative to
    in-distribution performance. A negative delta indicates overfitting.

    Args:
        in_dist_rate: Pass rate on in-distribution tasks (0.0 to 1.0).
        ood_rate: Pass rate on OOD tasks (0.0 to 1.0).

    Returns:
        Generalization delta as ood_rate - in_dist_rate.
    """
    return ood_rate - in_dist_rate

utils

Shared subprocess execution utilities for evaluation benchmarks.

Functions
safe_subprocess_run
safe_subprocess_run(
    script_path: Path, cwd: str, timeout: int = 30
) -> bool

Run a Python script in a subprocess and return whether it passed.

Handles subprocess.TimeoutExpired by returning False. Used by both HumanEval and OOD benchmark runners to avoid duplicated try/except + subprocess.run boilerplate.

Parameters:

Name Type Description Default
script_path Path

Path to the Python script to execute.

required
cwd str

Working directory for the subprocess.

required
timeout int

Maximum execution time in seconds before the process is killed.

30

Returns:

Type Description
bool

True if the process exits with return code 0, False otherwise

bool

(including timeout).

Example

from pathlib import Path passed = safe_subprocess_run(Path("/tmp/test.py"), cwd="/tmp") isinstance(passed, bool) True

Source code in libs/evaluation/src/evaluation/utils.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def safe_subprocess_run(
    script_path: Path,
    cwd: str,
    timeout: int = 30,
) -> bool:
    """Run a Python script in a subprocess and return whether it passed.

    Handles ``subprocess.TimeoutExpired`` by returning ``False``.
    Used by both HumanEval and OOD benchmark runners to avoid duplicated
    try/except + subprocess.run boilerplate.

    Args:
        script_path: Path to the Python script to execute.
        cwd: Working directory for the subprocess.
        timeout: Maximum execution time in seconds before the process is killed.

    Returns:
        ``True`` if the process exits with return code 0, ``False`` otherwise
        (including timeout).

    Example:
        >>> from pathlib import Path
        >>> passed = safe_subprocess_run(Path("/tmp/test.py"), cwd="/tmp")
        >>> isinstance(passed, bool)
        True
    """
    try:
        proc = subprocess.run(
            [sys.executable, str(script_path)],
            capture_output=True,
            text=True,
            timeout=timeout,
            cwd=cwd,
        )
        return proc.returncode == 0
    except subprocess.TimeoutExpired:
        return False