API Reference for evaluation¶
evaluation ¶
Evaluation library for adapter benchmarking and fitness scoring.
Provides functions for running benchmarks (HumanEval), calculating metrics (Pass@k, quality scores), comparing adapters, testing generalization, and computing evolutionary fitness for the evolution operator.
Functions¶
calculate_pass_at_k ¶
calculate_pass_at_k(
n_samples: int, n_correct: int, k: int = 1
) -> float
Calculate the Pass@k metric for code generation evaluation.
Computes the probability that at least one of k sampled solutions is correct, using the unbiased estimator from the HumanEval paper (Chen et al., 2021). This avoids the high-variance naive estimator.
Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Total number of samples generated per problem. |
required |
n_correct
|
int
|
Number of correct samples out of n_samples. |
required |
k
|
int
|
Number of attempts allowed for the Pass@k metric. Defaults to 1. |
1
|
Returns:
| Type | Description |
|---|---|
float
|
Pass@k probability as a float between 0.0 and 1.0. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If n_correct > n_samples. |
Example
score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1) score 0.85
Source code in libs/evaluation/src/evaluation/metrics.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
compare_adapters ¶
compare_adapters(
adapter_ids: list[str], benchmark: str = "humaneval"
) -> dict[str, Any]
Compare multiple adapters head-to-head on a benchmark.
Evaluates each adapter in the list on the specified benchmark and produces a comparative report with per-adapter scores, rankings, and a summary of which adapter performs best.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_ids
|
list[str]
|
List of adapter UUIDs to compare. Must contain at least two adapter IDs. |
required |
benchmark
|
str
|
Benchmark name to use for comparison. Currently supports "humaneval". Defaults to "humaneval". |
'humaneval'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with comparison results including: - "scores": dict mapping adapter_id to its benchmark score - "rankings": list of adapter_ids sorted best-to-worst - "best_adapter": str, UUID of the top-performing adapter - "summary": str, human-readable comparison summary |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
compare_adapters is not yet implemented. |
Example
results = compare_adapters(["adapter-001", "adapter-002"]) results["best_adapter"] in ["adapter-001", "adapter-002"] True results["best_adapter"] in results["rankings"] True
Source code in libs/evaluation/src/evaluation/metrics.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
evaluate_fitness ¶
evaluate_fitness(
adapter_id: str,
pass_rate: float,
diversity_score: float = 0.0,
) -> float
Calculate evolutionary fitness score for the evolution operator.
Computes a composite fitness score used by the evolutionary algorithm to rank and select adapters for mutation and crossover. Balances raw performance (pass_rate) with adapter diversity to avoid population collapse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to evaluate. |
required |
pass_rate
|
float
|
Fraction of benchmark tasks passed, in range 0.0 to 1.0. |
required |
diversity_score
|
float
|
Adapter uniqueness metric representing how different this adapter is from others in the current population. Range 0.0 to 1.0; higher values indicate more unique adapters. Defaults to 0.0. |
0.0
|
Returns:
| Type | Description |
|---|---|
float
|
Fitness score as a float between 0.0 and 1.0. Higher values indicate |
float
|
adapters more likely to be selected for the next evolutionary generation. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
evaluate_fitness is not yet implemented. |
Example
fitness = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.3, ... ) fitness 0.795 low_diversity = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.0, ... ) low_diversity < fitness True
Source code in libs/evaluation/src/evaluation/metrics.py
343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | |
run_humaneval_subset ¶
run_humaneval_subset(
adapter_id: Optional[str],
subset_size: int = 20,
model: str = "Qwen/Qwen2.5-Coder-7B",
completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]
Run a HumanEval benchmark subset to evaluate an adapter.
Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates them using the provided completions. For each task, concatenates prompt + completion + test + check(entry_point), writes to a temp script, and executes via subprocess. Exit code 0 = passed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
Optional[str]
|
UUID of the adapter to test (None = baseline, no adapter). Currently informational; inference wiring happens at a higher level. |
required |
subset_size
|
int
|
Ignored — always uses the fixed 20-task bundled subset. |
20
|
model
|
str
|
Base model name (informational only, not used for inference here). |
'Qwen/Qwen2.5-Coder-7B'
|
completions
|
Optional[dict[str, str]]
|
Dict mapping task_id -> completion string. Only tasks with an entry in this dict are evaluated. If None or empty, returns empty results. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with benchmark results including: - "pass_count": int, number of tasks passed - "fail_count": int, number of tasks failed - "pass_rate": float, fraction of tasks passed (0.0 to 1.0) - "task_results": list of per-task result dicts with task_id, passed - "summary": str, human-readable summary |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If completions are not provided (inference not wired). |
Example
completions = {"HumanEval/0": " return []"} results = run_humaneval_subset(adapter_id=None, completions=completions) results["pass_count"] 0
Source code in libs/evaluation/src/evaluation/metrics.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
run_kill_switch_gate ¶
run_kill_switch_gate(
baseline_pass1: float,
adapter_pass1: float,
threshold: float = 0.05,
) -> dict[str, object]
Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.
Computes the relative improvement of the adapter over the baseline and returns PASS if the improvement meets the threshold, FAIL otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_pass1
|
float
|
Pass@1 score for the baseline model (no adapter). |
required |
adapter_pass1
|
float
|
Pass@1 score for the adapter model. |
required |
threshold
|
float
|
Minimum required relative improvement (default 0.05 = 5%). |
0.05
|
Returns:
| Type | Description |
|---|---|
dict[str, object]
|
Dictionary with: - "baseline_pass1": float - "adapter_pass1": float - "relative_delta": float, relative improvement over baseline - "verdict": str, "PASS" or "FAIL" |
Example
result = run_kill_switch_gate(0.50, 0.55) result["verdict"] 'PASS'
Source code in libs/evaluation/src/evaluation/metrics.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
score_adapter_quality ¶
score_adapter_quality(
adapter_id: str,
pass_rate: float,
generalization_delta: float | None = None,
) -> float
Compute an overall quality score for an adapter.
Aggregates the adapter's benchmark pass rate with its generalization performance (if available) into a single scalar quality score. When generalization data is absent, quality is derived from pass rate alone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to score. |
required |
pass_rate
|
float
|
Fraction of benchmark tasks passed, in range 0.0 to 1.0. |
required |
generalization_delta
|
float | None
|
Optional difference between in-distribution and out-of-distribution (OOD) performance. Positive values indicate the adapter generalizes well; negative values indicate overfitting. Defaults to None (not measured). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Quality score as a float between 0.0 and 1.0. Higher values indicate |
float
|
a better overall adapter quality. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
score_adapter_quality is not yet implemented. |
Example
score = score_adapter_quality("adapter-001", pass_rate=0.85) score 0.85 score_with_gen = score_adapter_quality( ... "adapter-001", 0.85, generalization_delta=0.1, ... ) score_with_gen > score True
Source code in libs/evaluation/src/evaluation/metrics.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | |
test_generalization ¶
test_generalization(
adapter_id: str,
in_distribution_tasks: list[str] | None = None,
ood_tasks: list[str] | None = None,
) -> dict[str, Any]
Test whether an adapter generalizes beyond its training distribution.
Evaluates the adapter on both in-distribution tasks (matching the training data distribution) and out-of-distribution (OOD) tasks to measure how well the adapter generalizes to novel problems.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to evaluate. |
required |
in_distribution_tasks
|
list[str] | None
|
Optional list of task IDs matching the adapter's training distribution. If None, uses a default in-distribution set. |
None
|
ood_tasks
|
list[str] | None
|
Optional list of out-of-distribution task IDs to test generalization on. If None, uses a default OOD task set. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with generalization results including: - "in_distribution_score": float, performance on training-distribution tasks - "ood_score": float, performance on out-of-distribution tasks - "generalization_delta": float, difference (in_distribution - ood) - "generalizes": bool, True if generalization_delta is within threshold |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
test_generalization is not yet implemented. |
Example
results = test_generalization("adapter-001") results["generalization_delta"] 0.05 results["generalizes"] True
Source code in libs/evaluation/src/evaluation/metrics.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | |
compute_generalization_delta ¶
compute_generalization_delta(
in_dist_rate: float, ood_rate: float
) -> float
Compute the generalization delta between in-distribution and OOD performance.
A positive delta means the adapter does better on OOD tasks relative to in-distribution performance. A negative delta indicates overfitting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_dist_rate
|
float
|
Pass rate on in-distribution tasks (0.0 to 1.0). |
required |
ood_rate
|
float
|
Pass rate on OOD tasks (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Generalization delta as ood_rate - in_dist_rate. |
Source code in libs/evaluation/src/evaluation/ood_benchmark.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
run_ood_benchmark ¶
run_ood_benchmark(
adapter_id: str | None,
completions: dict[str, str],
benchmark_name: str = "ood_python",
) -> dict[str, Any]
Run an out-of-distribution benchmark on provided completions.
Evaluates completions against OOD tasks from the bundled task set. Each task's prompt + completion is executed in a subprocess with its test harness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str | None
|
UUID of the adapter being tested (informational). |
required |
completions
|
dict[str, str]
|
Dict mapping task_id to completion string. |
required |
benchmark_name
|
str
|
Name of the OOD benchmark set. |
'ood_python'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with ood_pass_rate and per-task results. |
Source code in libs/evaluation/src/evaluation/ood_benchmark.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
Modules¶
metrics ¶
Evaluation metrics for adapter benchmarking and fitness scoring.
Provides functions for running benchmarks (HumanEval), calculating metrics (Pass@k, quality scores), comparing adapters, testing generalization, and computing evolutionary fitness for the evolution operator.
Functions¶
calculate_pass_at_k ¶
calculate_pass_at_k(
n_samples: int, n_correct: int, k: int = 1
) -> float
Calculate the Pass@k metric for code generation evaluation.
Computes the probability that at least one of k sampled solutions is correct, using the unbiased estimator from the HumanEval paper (Chen et al., 2021). This avoids the high-variance naive estimator.
Formula: 1 - prod((n-c-i)/(n-i) for i in range(k))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Total number of samples generated per problem. |
required |
n_correct
|
int
|
Number of correct samples out of n_samples. |
required |
k
|
int
|
Number of attempts allowed for the Pass@k metric. Defaults to 1. |
1
|
Returns:
| Type | Description |
|---|---|
float
|
Pass@k probability as a float between 0.0 and 1.0. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If n_correct > n_samples. |
Example
score = calculate_pass_at_k(n_samples=100, n_correct=85, k=1) score 0.85
Source code in libs/evaluation/src/evaluation/metrics.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
run_kill_switch_gate ¶
run_kill_switch_gate(
baseline_pass1: float,
adapter_pass1: float,
threshold: float = 0.05,
) -> dict[str, object]
Compare baseline vs adapter Pass@1 scores and return a PASS/FAIL verdict.
Computes the relative improvement of the adapter over the baseline and returns PASS if the improvement meets the threshold, FAIL otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_pass1
|
float
|
Pass@1 score for the baseline model (no adapter). |
required |
adapter_pass1
|
float
|
Pass@1 score for the adapter model. |
required |
threshold
|
float
|
Minimum required relative improvement (default 0.05 = 5%). |
0.05
|
Returns:
| Type | Description |
|---|---|
dict[str, object]
|
Dictionary with: - "baseline_pass1": float - "adapter_pass1": float - "relative_delta": float, relative improvement over baseline - "verdict": str, "PASS" or "FAIL" |
Example
result = run_kill_switch_gate(0.50, 0.55) result["verdict"] 'PASS'
Source code in libs/evaluation/src/evaluation/metrics.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
run_humaneval_subset ¶
run_humaneval_subset(
adapter_id: Optional[str],
subset_size: int = 20,
model: str = "Qwen/Qwen2.5-Coder-7B",
completions: Optional[dict[str, str]] = None,
) -> dict[str, Any]
Run a HumanEval benchmark subset to evaluate an adapter.
Loads tasks from the bundled 20-task HumanEval subset JSON and evaluates them using the provided completions. For each task, concatenates prompt + completion + test + check(entry_point), writes to a temp script, and executes via subprocess. Exit code 0 = passed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
Optional[str]
|
UUID of the adapter to test (None = baseline, no adapter). Currently informational; inference wiring happens at a higher level. |
required |
subset_size
|
int
|
Ignored — always uses the fixed 20-task bundled subset. |
20
|
model
|
str
|
Base model name (informational only, not used for inference here). |
'Qwen/Qwen2.5-Coder-7B'
|
completions
|
Optional[dict[str, str]]
|
Dict mapping task_id -> completion string. Only tasks with an entry in this dict are evaluated. If None or empty, returns empty results. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with benchmark results including: - "pass_count": int, number of tasks passed - "fail_count": int, number of tasks failed - "pass_rate": float, fraction of tasks passed (0.0 to 1.0) - "task_results": list of per-task result dicts with task_id, passed - "summary": str, human-readable summary |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If completions are not provided (inference not wired). |
Example
completions = {"HumanEval/0": " return []"} results = run_humaneval_subset(adapter_id=None, completions=completions) results["pass_count"] 0
Source code in libs/evaluation/src/evaluation/metrics.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
score_adapter_quality ¶
score_adapter_quality(
adapter_id: str,
pass_rate: float,
generalization_delta: float | None = None,
) -> float
Compute an overall quality score for an adapter.
Aggregates the adapter's benchmark pass rate with its generalization performance (if available) into a single scalar quality score. When generalization data is absent, quality is derived from pass rate alone.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to score. |
required |
pass_rate
|
float
|
Fraction of benchmark tasks passed, in range 0.0 to 1.0. |
required |
generalization_delta
|
float | None
|
Optional difference between in-distribution and out-of-distribution (OOD) performance. Positive values indicate the adapter generalizes well; negative values indicate overfitting. Defaults to None (not measured). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Quality score as a float between 0.0 and 1.0. Higher values indicate |
float
|
a better overall adapter quality. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
score_adapter_quality is not yet implemented. |
Example
score = score_adapter_quality("adapter-001", pass_rate=0.85) score 0.85 score_with_gen = score_adapter_quality( ... "adapter-001", 0.85, generalization_delta=0.1, ... ) score_with_gen > score True
Source code in libs/evaluation/src/evaluation/metrics.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | |
compare_adapters ¶
compare_adapters(
adapter_ids: list[str], benchmark: str = "humaneval"
) -> dict[str, Any]
Compare multiple adapters head-to-head on a benchmark.
Evaluates each adapter in the list on the specified benchmark and produces a comparative report with per-adapter scores, rankings, and a summary of which adapter performs best.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_ids
|
list[str]
|
List of adapter UUIDs to compare. Must contain at least two adapter IDs. |
required |
benchmark
|
str
|
Benchmark name to use for comparison. Currently supports "humaneval". Defaults to "humaneval". |
'humaneval'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with comparison results including: - "scores": dict mapping adapter_id to its benchmark score - "rankings": list of adapter_ids sorted best-to-worst - "best_adapter": str, UUID of the top-performing adapter - "summary": str, human-readable comparison summary |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
compare_adapters is not yet implemented. |
Example
results = compare_adapters(["adapter-001", "adapter-002"]) results["best_adapter"] in ["adapter-001", "adapter-002"] True results["best_adapter"] in results["rankings"] True
Source code in libs/evaluation/src/evaluation/metrics.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
test_generalization ¶
test_generalization(
adapter_id: str,
in_distribution_tasks: list[str] | None = None,
ood_tasks: list[str] | None = None,
) -> dict[str, Any]
Test whether an adapter generalizes beyond its training distribution.
Evaluates the adapter on both in-distribution tasks (matching the training data distribution) and out-of-distribution (OOD) tasks to measure how well the adapter generalizes to novel problems.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to evaluate. |
required |
in_distribution_tasks
|
list[str] | None
|
Optional list of task IDs matching the adapter's training distribution. If None, uses a default in-distribution set. |
None
|
ood_tasks
|
list[str] | None
|
Optional list of out-of-distribution task IDs to test generalization on. If None, uses a default OOD task set. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with generalization results including: - "in_distribution_score": float, performance on training-distribution tasks - "ood_score": float, performance on out-of-distribution tasks - "generalization_delta": float, difference (in_distribution - ood) - "generalizes": bool, True if generalization_delta is within threshold |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
test_generalization is not yet implemented. |
Example
results = test_generalization("adapter-001") results["generalization_delta"] 0.05 results["generalizes"] True
Source code in libs/evaluation/src/evaluation/metrics.py
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | |
evaluate_fitness ¶
evaluate_fitness(
adapter_id: str,
pass_rate: float,
diversity_score: float = 0.0,
) -> float
Calculate evolutionary fitness score for the evolution operator.
Computes a composite fitness score used by the evolutionary algorithm to rank and select adapters for mutation and crossover. Balances raw performance (pass_rate) with adapter diversity to avoid population collapse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
UUID of the adapter to evaluate. |
required |
pass_rate
|
float
|
Fraction of benchmark tasks passed, in range 0.0 to 1.0. |
required |
diversity_score
|
float
|
Adapter uniqueness metric representing how different this adapter is from others in the current population. Range 0.0 to 1.0; higher values indicate more unique adapters. Defaults to 0.0. |
0.0
|
Returns:
| Type | Description |
|---|---|
float
|
Fitness score as a float between 0.0 and 1.0. Higher values indicate |
float
|
adapters more likely to be selected for the next evolutionary generation. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
evaluate_fitness is not yet implemented. |
Example
fitness = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.3, ... ) fitness 0.795 low_diversity = evaluate_fitness( ... "adapter-001", pass_rate=0.85, diversity_score=0.0, ... ) low_diversity < fitness True
Source code in libs/evaluation/src/evaluation/metrics.py
343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | |
ood_benchmark ¶
Out-of-distribution benchmark for adapter generalization testing.
Provides functions for evaluating adapter performance on tasks outside the training distribution, measuring generalization capability.
Functions¶
run_ood_benchmark ¶
run_ood_benchmark(
adapter_id: str | None,
completions: dict[str, str],
benchmark_name: str = "ood_python",
) -> dict[str, Any]
Run an out-of-distribution benchmark on provided completions.
Evaluates completions against OOD tasks from the bundled task set. Each task's prompt + completion is executed in a subprocess with its test harness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str | None
|
UUID of the adapter being tested (informational). |
required |
completions
|
dict[str, str]
|
Dict mapping task_id to completion string. |
required |
benchmark_name
|
str
|
Name of the OOD benchmark set. |
'ood_python'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with ood_pass_rate and per-task results. |
Source code in libs/evaluation/src/evaluation/ood_benchmark.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
compute_generalization_delta ¶
compute_generalization_delta(
in_dist_rate: float, ood_rate: float
) -> float
Compute the generalization delta between in-distribution and OOD performance.
A positive delta means the adapter does better on OOD tasks relative to in-distribution performance. A negative delta indicates overfitting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_dist_rate
|
float
|
Pass rate on in-distribution tasks (0.0 to 1.0). |
required |
ood_rate
|
float
|
Pass rate on OOD tasks (0.0 to 1.0). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Generalization delta as ood_rate - in_dist_rate. |
Source code in libs/evaluation/src/evaluation/ood_benchmark.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
utils ¶
Shared subprocess execution utilities for evaluation benchmarks.
Functions¶
safe_subprocess_run ¶
safe_subprocess_run(
script_path: Path, cwd: str, timeout: int = 30
) -> bool
Run a Python script in a subprocess and return whether it passed.
Handles subprocess.TimeoutExpired by returning False.
Used by both HumanEval and OOD benchmark runners to avoid duplicated
try/except + subprocess.run boilerplate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
script_path
|
Path
|
Path to the Python script to execute. |
required |
cwd
|
str
|
Working directory for the subprocess. |
required |
timeout
|
int
|
Maximum execution time in seconds before the process is killed. |
30
|
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
(including timeout). |
Example
from pathlib import Path passed = safe_subprocess_run(Path("/tmp/test.py"), cwd="/tmp") isinstance(passed, bool) True
Source code in libs/evaluation/src/evaluation/utils.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |