API Reference for inference

inference

Inference provider library for LLM generation and LoRA adapter management.

Provides a provider-agnostic interface (InferenceProvider) with vLLM and Ollama backends, a factory for backend selection by configuration, and structured generation results.

Provider classes (OllamaProvider, VLLMProvider) are lazily imported to avoid hard failures when the openai package is not installed (e.g. in CI).

Classes

UnsupportedOperationError

Bases: Exception

Raised when a provider does not support the requested operation.

Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.

Example

raise UnsupportedOperationError("OllamaProvider does not support adapters.")

GenerationResult dataclass

GenerationResult(
    text: str,
    model: str,
    adapter_id: str | None,
    token_count: int,
    finish_reason: str,
)

Structured result returned by InferenceProvider.generate().

Attributes:

Name Type Description
text str

The generated text output from the model.

model str

The model identifier used for generation.

adapter_id str | None

The LoRA adapter applied during generation, or None if no adapter was used.

token_count int

Total number of tokens consumed (prompt + completion).

finish_reason str

Reason generation stopped (e.g. "stop", "length").

Example

result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )

InferenceProvider

Bases: ABC

Abstract base class for inference providers.

Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.

Concrete implementations
  • VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
  • OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.
Functions
generate abstractmethod async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

The model identifier to use for generation.

required
adapter_id str | None

Optional LoRA adapter to apply during generation. If None, uses the base model directly.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction. Providers that support chat templates will format this as a system message.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

A GenerationResult containing the generated text and metadata.

Example

result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/provider.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@abstractmethod
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt.

    Args:
        prompt: The user-facing input prompt.
        model: The model identifier to use for generation.
        adapter_id: Optional LoRA adapter to apply during generation.
            If None, uses the base model directly.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction. Providers that
            support chat templates will format this as a system message.

        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        A GenerationResult containing the generated text and metadata.

    Example:
        >>> result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    ...
load_adapter abstractmethod async
load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the inference server.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter (used as lora_name in vLLM).

required
adapter_path str

Filesystem path to the adapter weights directory.

required

Raises:

Type Description
UnsupportedOperationError

If the provider does not support adapters.

HTTPStatusError

If the server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/provider.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
@abstractmethod
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the inference server.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name in vLLM).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    ...
unload_adapter abstractmethod async
unload_adapter(adapter_id: str) -> None

Unload a previously loaded LoRA adapter.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove from the server.

required

Raises:

Type Description
UnsupportedOperationError

If the provider does not support adapters.

HTTPStatusError

If the server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/provider.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@abstractmethod
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a previously loaded LoRA adapter.

    Args:
        adapter_id: The adapter name to remove from the server.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    ...
list_adapters abstractmethod async
list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns:

Type Description
list[str]

Sorted list of adapter IDs currently available for inference.

list[str]

Returns an empty list if no adapters are loaded or the provider

list[str]

does not support adapters.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/provider.py
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@abstractmethod
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns:
        Sorted list of adapter IDs currently available for inference.
        Returns an empty list if no adapters are loaded or the provider
        does not support adapters.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    ...

LlamaCppProvider

LlamaCppProvider(
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
)

Bases: InferenceProvider

InferenceProvider backed by llama-cpp-python with native LoRA support.

Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.

The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.

Parameters:

Name Type Description Default
model_path str | None

Path to the GGUF model file.

None
n_ctx int

Context window size. Default: 4096.

4096
n_gpu_layers int

Layers to offload to GPU (-1 = all). Default: -1.

-1
Example

provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")

Initialize LlamaCppProvider with model configuration.

Parameters:

Name Type Description Default
model_path str | None

Path to the GGUF model file.

None
n_ctx int

Context window size. Default: 4096.

4096
n_gpu_layers int

Layers to offload to GPU (-1 = all). Default: -1.

-1
Source code in libs/inference/src/inference/llamacpp_provider.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def __init__(
    self,
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
) -> None:
    """Initialize LlamaCppProvider with model configuration.

    Args:
        model_path: Path to the GGUF model file.
        n_ctx: Context window size. Default: 4096.
        n_gpu_layers: Layers to offload to GPU (-1 = all). Default: -1.
    """
    self._model_path = model_path or ""
    self._n_ctx = n_ctx
    self._n_gpu_layers = n_gpu_layers
    self._llm: Any = None
    self._current_lora: str | None = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using llama-cpp-python with optional LoRA adapter.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Ignored (model is set at construction via model_path).

required
adapter_id str | None

LoRA adapter ID to apply. Must be loaded via load_adapter() before use.

None
max_tokens int

Maximum tokens to generate.

4096
system_prompt str | None

Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support).

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with generated text and metadata.

Raises:

Type Description
ValueError

If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/llamacpp_provider.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using llama-cpp-python with optional LoRA adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction via model_path).
        adapter_id: LoRA adapter ID to apply. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction. Prepended to
            the prompt for llama.cpp (no native chat template support).
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    lora_path: str | None = None
    if adapter_id:
        if adapter_id not in self._loaded_adapters:
            raise ValueError(
                f"Adapter '{adapter_id}' has not been loaded. "
                "Call load_adapter() first."
            )
        lora_path = self._loaded_adapters[adapter_id]

    self._load_model_if_needed(lora_path=lora_path)

    full_prompt = f"{system_prompt}\n\n{prompt}" if system_prompt else prompt
    response = self._llm(  # type: ignore[union-attr]
        full_prompt,
        max_tokens=max_tokens,
        stop=_STOP_SEQUENCES,
    )

    text = response["choices"][0]["text"]
    token_count = response["usage"]["total_tokens"]
    finish_reason = response["choices"][0].get("finish_reason", "stop")

    return GenerationResult(
        text=text,
        model=Path(self._model_path).stem,
        adapter_id=adapter_id,
        token_count=token_count,
        finish_reason=finish_reason,
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Register a LoRA adapter for use during generation.

The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter.

required
adapter_path str

Filesystem path to the LoRA adapter weights (GGUF format).

required
Source code in libs/inference/src/inference/llamacpp_provider.py
161
162
163
164
165
166
167
168
169
170
171
172
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a LoRA adapter for use during generation.

    The adapter is applied on next generate() call by reloading the model
    with the LoRA path. llama-cpp-python applies LoRA at model load time.

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Filesystem path to the LoRA adapter weights (GGUF format).
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Remove a registered LoRA adapter.

If the currently active adapter is unloaded, the model will reload without it on the next generate() call.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove.

required
Source code in libs/inference/src/inference/llamacpp_provider.py
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered LoRA adapter.

    If the currently active adapter is unloaded, the model will reload
    without it on the next generate() call.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        was_active = self._current_lora == self._loaded_adapters[adapter_id]
        del self._loaded_adapters[adapter_id]
        if was_active:
            self._current_lora = None  # Force reload without adapter
        logger.info("Unloaded adapter %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all registered LoRA adapter IDs.

Returns:

Type Description
list[str]

Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/llamacpp_provider.py
190
191
192
193
194
195
196
async def list_adapters(self) -> list[str]:
    """List all registered LoRA adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

OllamaProvider

OllamaProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by an Ollama server.

Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.

Note

Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.

Attributes:

Name Type Description
_client

AsyncOpenAI client pointing at the Ollama server.

Example

provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")

Initialize OllamaProvider with an AsyncOpenAI client.

Parameters:

Name Type Description Default
base_url str | None

Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1.

None
Source code in libs/inference/src/inference/ollama_provider.py
39
40
41
42
43
44
45
46
47
48
49
50
51
def __init__(self, base_url: str | None = None) -> None:
    """Initialize OllamaProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the Ollama server. Defaults to
            OLLAMA_BASE_URL env var or http://localhost:11434/v1.
    """
    self._base_url = base_url or OLLAMA_BASE_URL
    # Ollama requires a non-empty api_key but ignores its value.
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="ollama",
    )
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt using the base Ollama model.

If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

The Ollama model identifier (e.g. "qwen2.5-coder:7b").

required
adapter_id str | None

Ignored. If provided, a warning is logged.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with adapter_id=None (Ollama has no adapter concept).

Example

result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)

Source code in libs/inference/src/inference/ollama_provider.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt using the base Ollama model.

    If adapter_id is provided, a warning is logged and it is ignored —
    Ollama does not support LoRA adapters. The base model is always used.

    Args:
        prompt: The user-facing input prompt.
        model: The Ollama model identifier (e.g. "qwen2.5-coder:7b").
        adapter_id: Ignored. If provided, a warning is logged.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with adapter_id=None (Ollama has no adapter concept).

    Example:
        >>> result = await provider.generate("def fib", model="qwen2.5-coder:7b")
        >>> print(result.text)
    """
    if adapter_id is not None:
        logger.warning(
            "OllamaProvider ignoring adapter_id=%s; "
            "Ollama does not support LoRA adapters.",
            adapter_id,
        )

    logger.debug("generate: model=%s max_tokens=%d", model, max_tokens)

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=None,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name Type Description Default
adapter_id str

Unused.

required
adapter_path str

Unused.

required

Raises:

Type Description
UnsupportedOperationError

Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Raises UnsupportedOperationError
Source code in libs/inference/src/inference/ollama_provider.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.
        adapter_path: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter loading. Use VLLMProvider for adapter operations.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter loading. "
        "Use VLLMProvider for adapter operations."
    )
unload_adapter async
unload_adapter(adapter_id: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name Type Description Default
adapter_id str

Unused.

required

Raises:

Type Description
UnsupportedOperationError

Always — Ollama does not support LoRA adapter unloading.

Example

await provider.unload_adapter("adapter-001")

Raises UnsupportedOperationError
Source code in libs/inference/src/inference/ollama_provider.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
async def unload_adapter(self, adapter_id: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter unloading.

    Example:
        >>> await provider.unload_adapter("adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter unloading."
    )
list_adapters async
list_adapters() -> list[str]

Return an empty list — Ollama has no adapter concept.

Returns:

Type Description
list[str]

Always returns an empty list.

Example

adapters = await provider.list_adapters() print(adapters) # []

Source code in libs/inference/src/inference/ollama_provider.py
152
153
154
155
156
157
158
159
160
161
162
async def list_adapters(self) -> list[str]:
    """Return an empty list — Ollama has no adapter concept.

    Returns:
        Always returns an empty list.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # []
    """
    return []

TransformersProvider

TransformersProvider(
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
)

Bases: InferenceProvider

InferenceProvider backed by HuggingFace transformers with PEFT LoRA.

Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID or local path.

''
device str

Device to load model onto ('cpu', 'mps', 'cuda').

'cpu'
torch_dtype str

Model dtype ('auto', 'float16', 'bfloat16').

'auto'
Example

provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")

Initialize TransformersProvider.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID or local path.

''
device str

Device to load model onto.

'cpu'
torch_dtype str

Model dtype string.

'auto'
Source code in libs/inference/src/inference/transformers_provider.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __init__(
    self,
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
) -> None:
    """Initialize TransformersProvider.

    Args:
        model_name: HuggingFace model ID or local path.
        device: Device to load model onto.
        torch_dtype: Model dtype string.
    """
    self._model_name = model_name
    self._device = device
    self._torch_dtype = torch_dtype
    self._model: Any = None
    self._tokenizer: Any = None
    self._base_model: Any = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
    self._active_adapter: str | None = None
    self._is_peft_wrapped: bool = False
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using transformers with optional PEFT adapter.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Ignored (model is set at construction).

required
adapter_id str | None

LoRA adapter ID to activate. Must be loaded via load_adapter() before use.

None
max_tokens int

Maximum tokens to generate.

4096
system_prompt str | None

Optional system-level instruction prepended via the tokenizer's chat template when available.

None
temperature float | None

Sampling temperature (default from pipeline config).

None
top_p float | None

Nucleus sampling threshold (default from pipeline config).

None
repetition_penalty float | None

Repetition penalty (default 1.0 = off).

None

Returns:

Type Description
GenerationResult

GenerationResult with generated text and metadata.

Raises:

Type Description
ValueError

If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/transformers_provider.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using transformers with optional PEFT adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction).
        adapter_id: LoRA adapter ID to activate. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction prepended via
            the tokenizer's chat template when available.
        temperature: Sampling temperature (default from pipeline config).
        top_p: Nucleus sampling threshold (default from pipeline config).
        repetition_penalty: Repetition penalty (default 1.0 = off).

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    import torch  # noqa: PLC0415

    self._load_model_if_needed()

    # Apply defaults from pipeline config
    if temperature is None:
        temperature = float(os.environ.get("RUNE_TEMPERATURE", "0.25"))
    if top_p is None:
        top_p = float(os.environ.get("RUNE_TOP_P", "0.9"))
    if repetition_penalty is None:
        repetition_penalty = float(
            os.environ.get("RUNE_REPETITION_PENALTY", "1.04")
        )

    # Validate adapter before switching
    if adapter_id and adapter_id not in self._loaded_adapters:
        raise ValueError(
            f"Adapter '{adapter_id}' has not been loaded. "
            "Call load_adapter() first."
        )

    # Switch adapter if needed
    if adapter_id and adapter_id != self._active_adapter:
        self._activate_adapter(adapter_id)
    elif not adapter_id and self._active_adapter:
        self._deactivate_adapter()

    # Build chat-formatted prompt via tokenizer's chat template
    formatted = self._format_prompt(prompt, system_prompt)
    inputs = self._tokenizer(
        formatted, return_tensors="pt", truncation=True, max_length=8192
    )
    inputs = {k: v.to(self._device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[1]

    gen_kwargs: dict[str, object] = {
        "max_new_tokens": max_tokens,
        "do_sample": temperature > 0,
        "temperature": max(temperature, 0.01),
        "top_p": top_p,
        "pad_token_id": self._tokenizer.pad_token_id,
    }
    if repetition_penalty > 1.0:
        gen_kwargs["repetition_penalty"] = repetition_penalty

    with torch.no_grad():
        outputs = self._model.generate(**inputs, **gen_kwargs)

    new_tokens = outputs[0][input_len:]
    text = self._tokenizer.decode(new_tokens, skip_special_tokens=True)
    total_tokens = outputs.shape[1]
    new_token_count = len(new_tokens)

    # Detect truncation: generated exactly max_tokens means cut off
    finish_reason = "length" if new_token_count >= max_tokens else "stop"

    return GenerationResult(
        text=text,
        model=self._model_name,
        adapter_id=self._active_adapter,
        token_count=total_tokens,
        finish_reason=finish_reason,
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Register a PEFT adapter directory for use during generation.

The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter.

required
adapter_path str

Path to the PEFT adapter directory.

required
Source code in libs/inference/src/inference/transformers_provider.py
267
268
269
270
271
272
273
274
275
276
277
278
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a PEFT adapter directory for use during generation.

    The adapter directory must contain adapter_model.safetensors and
    adapter_config.json as output by save_hypernetwork_adapter().

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Path to the PEFT adapter directory.
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Remove a registered adapter, freeing GPU memory.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove.

required
Source code in libs/inference/src/inference/transformers_provider.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered adapter, freeing GPU memory.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        if self._active_adapter == adapter_id:
            self._deactivate_adapter()
        # Delete from PeftModel to free GPU memory
        if self._is_peft_wrapped and adapter_id in self._model.peft_config:
            self._model.delete_adapter(adapter_id)
        del self._loaded_adapters[adapter_id]
        # If no adapters remain, revert to base model
        if not self._loaded_adapters and self._is_peft_wrapped:
            self._model = self._base_model
            self._is_peft_wrapped = False
            logger.info("All adapters removed, reverted to base model")
        else:
            logger.info("Unloaded adapter %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all registered adapter IDs.

Returns:

Type Description
list[str]

Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/transformers_provider.py
301
302
303
304
305
306
307
async def list_adapters(self) -> list[str]:
    """List all registered adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

VLLMProvider

VLLMProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by a vLLM server with LoRA hot-loading support.

Communicates with vLLM via two channels
  • AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
  • httpx for LoRA adapter management (vLLM proprietary endpoints).

Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).

Attributes:

Name Type Description
_client

AsyncOpenAI client pointing at the vLLM server.

_base_url

Base URL string for constructing adapter management URLs.

_loaded_adapters set[str]

Set of currently tracked adapter IDs.

Example

provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")

Initialize VLLMProvider with an AsyncOpenAI client.

Parameters:

Name Type Description Default
base_url str | None

Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1.

None
Source code in libs/inference/src/inference/vllm_provider.py
40
41
42
43
44
45
46
47
48
49
50
51
52
def __init__(self, base_url: str | None = None) -> None:
    """Initialize VLLMProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the vLLM server. Defaults to
            VLLM_BASE_URL env var or http://localhost:8100/v1.
    """
    self._base_url = base_url or VLLM_BASE_URL
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="not-needed-for-local-vllm",
    )
    self._loaded_adapters: set[str] = set()
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt, optionally using a loaded LoRA adapter.

When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Base model identifier. Used as-is when no adapter is given.

required
adapter_id str | None

Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with the generated text and metadata.

Example

result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/vllm_provider.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt, optionally using a loaded LoRA adapter.

    When adapter_id is provided, it is passed as the model parameter to
    the OpenAI API — this is how vLLM identifies and routes to loaded
    LoRA adapters (the adapter is referenced by its lora_name).

    Args:
        prompt: The user-facing input prompt.
        model: Base model identifier. Used as-is when no adapter is given.
        adapter_id: Name of a loaded LoRA adapter to apply. When set,
            this value replaces model in the API call.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with the generated text and metadata.

    Example:
        >>> result = await provider.generate("def fib", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    effective_model = adapter_id if adapter_id is not None else model
    logger.debug(
        "generate: model=%s adapter_id=%s max_tokens=%d",
        effective_model,
        adapter_id,
        max_tokens,
    )

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=effective_model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=adapter_id,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the vLLM server.

Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter (used as lora_name).

required
adapter_path str

Filesystem path to the adapter weights directory.

required

Raises:

Type Description
HTTPStatusError

If the vLLM server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the vLLM server.

    Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter
    to the internal tracking set on success.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/load_lora_adapter"
    logger.debug("load_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id, "lora_path": adapter_path},
        )
        response.raise_for_status()

    self._loaded_adapters.add(adapter_id)
    logger.info("Adapter loaded: %s", adapter_id)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Unload a LoRA adapter from the vLLM server.

Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.

Parameters:

Name Type Description Default
adapter_id str

Name of the adapter to unload.

required

Raises:

Type Description
HTTPStatusError

If the vLLM server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a LoRA adapter from the vLLM server.

    Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the
    adapter from the internal tracking set.

    Args:
        adapter_id: Name of the adapter to unload.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/unload_lora_adapter"
    logger.debug("unload_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id},
        )
        response.raise_for_status()

    self._loaded_adapters.discard(adapter_id)
    logger.info("Adapter unloaded: %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).

Returns:

Type Description
list[str]

Sorted list of adapter IDs currently tracked as loaded.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/vllm_provider.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns the internal tracking set rather than querying vLLM to avoid
    the unreliable list endpoint (vLLM bug #11761).

    Returns:
        Sorted list of adapter IDs currently tracked as loaded.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    return sorted(self._loaded_adapters)

Functions

get_provider

get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider

Return a cached InferenceProvider for the given backend.

Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.

Parameters:

Name Type Description Default
provider_type str | None

One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm").

None
base_url str | None

Override URL for the backend server. If None, the per-backend default env var is used.

None

Returns:

Type Description
InferenceProvider

A cached InferenceProvider instance for the requested backend.

Raises:

Type Description
ValueError

If provider_type is not "vllm", "ollama", or "llamacpp".

Example

provider = get_provider("vllm") isinstance(provider, VLLMProvider) True

Source code in libs/inference/src/inference/factory.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider:
    """Return a cached InferenceProvider for the given backend.

    Resolves the provider type from the argument or the INFERENCE_PROVIDER
    env var (default: "vllm"). Resolves the base URL from the argument or
    the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances
    are cached by the (provider_type, base_url) tuple so repeated calls
    with the same arguments return the identical object.

    Args:
        provider_type: One of "vllm" or "ollama". If None, falls back to
            the INFERENCE_PROVIDER environment variable (default "vllm").
        base_url: Override URL for the backend server. If None, the
            per-backend default env var is used.

    Returns:
        A cached InferenceProvider instance for the requested backend.

    Raises:
        ValueError: If provider_type is not "vllm", "ollama", or "llamacpp".

    Example:
        >>> provider = get_provider("vllm")
        >>> isinstance(provider, VLLMProvider)
        True
    """
    ptype = (
        provider_type
        or os.environ.get("INFERENCE_PROVIDER", _DEFAULT_INFERENCE_PROVIDER)
    ).lower()

    resolved_url: str
    if ptype == "vllm":
        resolved_url = base_url or os.environ.get(
            "VLLM_BASE_URL", _DEFAULT_VLLM_BASE_URL
        )
    elif ptype == "ollama":
        resolved_url = base_url or os.environ.get(
            "OLLAMA_BASE_URL", _DEFAULT_OLLAMA_BASE_URL
        )
    elif ptype == "llamacpp":
        resolved_url = base_url or os.environ.get(
            "LLAMACPP_MODEL_PATH", _DEFAULT_LLAMACPP_MODEL_PATH
        )
    elif ptype == "transformers":
        resolved_url = base_url or os.environ.get("TRANSFORMERS_MODEL_NAME", "")
    else:
        raise ValueError(
            f"Unknown provider type: '{ptype}'. "
            "Supported values: 'vllm', 'ollama', 'llamacpp', 'transformers'."
        )

    cache_key = (ptype, resolved_url)
    if cache_key not in _provider_cache:
        if ptype == "vllm":
            from inference.vllm_provider import VLLMProvider

            _provider_cache[cache_key] = VLLMProvider(base_url=resolved_url)
        elif ptype == "llamacpp":
            from inference.llamacpp_provider import LlamaCppProvider

            _provider_cache[cache_key] = LlamaCppProvider(model_path=resolved_url)
        elif ptype == "transformers":
            from shared.hardware import get_best_device

            from inference.transformers_provider import TransformersProvider

            device = os.environ.get("TRANSFORMERS_DEVICE", get_best_device())
            _provider_cache[cache_key] = TransformersProvider(
                model_name=resolved_url, device=device
            )
        else:
            from inference.ollama_provider import OllamaProvider

            _provider_cache[cache_key] = OllamaProvider(base_url=resolved_url)

    return _provider_cache[cache_key]

get_provider_for_step

get_provider_for_step(
    step_config: dict[str, str],
) -> InferenceProvider

Return a cached InferenceProvider configured from a step config dict.

Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.

Parameters:

Name Type Description Default
step_config dict[str, str]

Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server.

required

Returns:

Type Description
InferenceProvider

A cached InferenceProvider instance for the step's backend.

Raises:

Type Description
ValueError

If the provider type in step_config is not supported.

Example

provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True

Source code in libs/inference/src/inference/factory.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def get_provider_for_step(step_config: dict[str, str]) -> InferenceProvider:
    """Return a cached InferenceProvider configured from a step config dict.

    Reads "provider" and optionally "base_url" from the step config and
    delegates to get_provider(). Designed for use by the agent loop where
    each pipeline step may specify its own provider and server URL.

    Args:
        step_config: Dict with optional keys:
            - "provider": Provider type ("vllm" or "ollama").
            - "base_url": Override URL for the backend server.

    Returns:
        A cached InferenceProvider instance for the step's backend.

    Raises:
        ValueError: If the provider type in step_config is not supported.

    Example:
        >>> provider = get_provider_for_step({"provider": "ollama"})
        >>> isinstance(provider, OllamaProvider)
        True
    """
    return get_provider(
        provider_type=step_config.get("provider"),
        base_url=step_config.get("base_url"),
    )

Modules

exceptions

Custom exceptions for the inference library.

Classes
UnsupportedOperationError

Bases: Exception

Raised when a provider does not support the requested operation.

Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.

Example

raise UnsupportedOperationError("OllamaProvider does not support adapters.")

factory

Provider factory with instance cache for the inference library.

Selects between VLLMProvider and OllamaProvider based on configuration, caching instances by (provider_type, base_url) to avoid redundant construction.

Classes
Functions
get_provider
get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider

Return a cached InferenceProvider for the given backend.

Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.

Parameters:

Name Type Description Default
provider_type str | None

One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm").

None
base_url str | None

Override URL for the backend server. If None, the per-backend default env var is used.

None

Returns:

Type Description
InferenceProvider

A cached InferenceProvider instance for the requested backend.

Raises:

Type Description
ValueError

If provider_type is not "vllm", "ollama", or "llamacpp".

Example

provider = get_provider("vllm") isinstance(provider, VLLMProvider) True

Source code in libs/inference/src/inference/factory.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider:
    """Return a cached InferenceProvider for the given backend.

    Resolves the provider type from the argument or the INFERENCE_PROVIDER
    env var (default: "vllm"). Resolves the base URL from the argument or
    the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances
    are cached by the (provider_type, base_url) tuple so repeated calls
    with the same arguments return the identical object.

    Args:
        provider_type: One of "vllm" or "ollama". If None, falls back to
            the INFERENCE_PROVIDER environment variable (default "vllm").
        base_url: Override URL for the backend server. If None, the
            per-backend default env var is used.

    Returns:
        A cached InferenceProvider instance for the requested backend.

    Raises:
        ValueError: If provider_type is not "vllm", "ollama", or "llamacpp".

    Example:
        >>> provider = get_provider("vllm")
        >>> isinstance(provider, VLLMProvider)
        True
    """
    ptype = (
        provider_type
        or os.environ.get("INFERENCE_PROVIDER", _DEFAULT_INFERENCE_PROVIDER)
    ).lower()

    resolved_url: str
    if ptype == "vllm":
        resolved_url = base_url or os.environ.get(
            "VLLM_BASE_URL", _DEFAULT_VLLM_BASE_URL
        )
    elif ptype == "ollama":
        resolved_url = base_url or os.environ.get(
            "OLLAMA_BASE_URL", _DEFAULT_OLLAMA_BASE_URL
        )
    elif ptype == "llamacpp":
        resolved_url = base_url or os.environ.get(
            "LLAMACPP_MODEL_PATH", _DEFAULT_LLAMACPP_MODEL_PATH
        )
    elif ptype == "transformers":
        resolved_url = base_url or os.environ.get("TRANSFORMERS_MODEL_NAME", "")
    else:
        raise ValueError(
            f"Unknown provider type: '{ptype}'. "
            "Supported values: 'vllm', 'ollama', 'llamacpp', 'transformers'."
        )

    cache_key = (ptype, resolved_url)
    if cache_key not in _provider_cache:
        if ptype == "vllm":
            from inference.vllm_provider import VLLMProvider

            _provider_cache[cache_key] = VLLMProvider(base_url=resolved_url)
        elif ptype == "llamacpp":
            from inference.llamacpp_provider import LlamaCppProvider

            _provider_cache[cache_key] = LlamaCppProvider(model_path=resolved_url)
        elif ptype == "transformers":
            from shared.hardware import get_best_device

            from inference.transformers_provider import TransformersProvider

            device = os.environ.get("TRANSFORMERS_DEVICE", get_best_device())
            _provider_cache[cache_key] = TransformersProvider(
                model_name=resolved_url, device=device
            )
        else:
            from inference.ollama_provider import OllamaProvider

            _provider_cache[cache_key] = OllamaProvider(base_url=resolved_url)

    return _provider_cache[cache_key]
get_provider_for_step
get_provider_for_step(
    step_config: dict[str, str],
) -> InferenceProvider

Return a cached InferenceProvider configured from a step config dict.

Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.

Parameters:

Name Type Description Default
step_config dict[str, str]

Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server.

required

Returns:

Type Description
InferenceProvider

A cached InferenceProvider instance for the step's backend.

Raises:

Type Description
ValueError

If the provider type in step_config is not supported.

Example

provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True

Source code in libs/inference/src/inference/factory.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def get_provider_for_step(step_config: dict[str, str]) -> InferenceProvider:
    """Return a cached InferenceProvider configured from a step config dict.

    Reads "provider" and optionally "base_url" from the step config and
    delegates to get_provider(). Designed for use by the agent loop where
    each pipeline step may specify its own provider and server URL.

    Args:
        step_config: Dict with optional keys:
            - "provider": Provider type ("vllm" or "ollama").
            - "base_url": Override URL for the backend server.

    Returns:
        A cached InferenceProvider instance for the step's backend.

    Raises:
        ValueError: If the provider type in step_config is not supported.

    Example:
        >>> provider = get_provider_for_step({"provider": "ollama"})
        >>> isinstance(provider, OllamaProvider)
        True
    """
    return get_provider(
        provider_type=step_config.get("provider"),
        base_url=step_config.get("base_url"),
    )

llamacpp_provider

LlamaCppProvider: InferenceProvider using llama-cpp-python with LoRA support.

Loads GGUF models via llama_cpp.Llama with optional LoRA adapter paths. Designed for Apple Silicon (Metal) local inference where adapter hot-loading is needed — Ollama cannot load LoRA adapters, and vLLM requires a server.

IMPORTANT: llama_cpp is imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI without llama-cpp-python.

Classes
LlamaCppProvider
LlamaCppProvider(
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
)

Bases: InferenceProvider

InferenceProvider backed by llama-cpp-python with native LoRA support.

Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.

The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.

Parameters:

Name Type Description Default
model_path str | None

Path to the GGUF model file.

None
n_ctx int

Context window size. Default: 4096.

4096
n_gpu_layers int

Layers to offload to GPU (-1 = all). Default: -1.

-1
Example

provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")

Initialize LlamaCppProvider with model configuration.

Parameters:

Name Type Description Default
model_path str | None

Path to the GGUF model file.

None
n_ctx int

Context window size. Default: 4096.

4096
n_gpu_layers int

Layers to offload to GPU (-1 = all). Default: -1.

-1
Source code in libs/inference/src/inference/llamacpp_provider.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def __init__(
    self,
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
) -> None:
    """Initialize LlamaCppProvider with model configuration.

    Args:
        model_path: Path to the GGUF model file.
        n_ctx: Context window size. Default: 4096.
        n_gpu_layers: Layers to offload to GPU (-1 = all). Default: -1.
    """
    self._model_path = model_path or ""
    self._n_ctx = n_ctx
    self._n_gpu_layers = n_gpu_layers
    self._llm: Any = None
    self._current_lora: str | None = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using llama-cpp-python with optional LoRA adapter.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Ignored (model is set at construction via model_path).

required
adapter_id str | None

LoRA adapter ID to apply. Must be loaded via load_adapter() before use.

None
max_tokens int

Maximum tokens to generate.

4096
system_prompt str | None

Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support).

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with generated text and metadata.

Raises:

Type Description
ValueError

If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/llamacpp_provider.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using llama-cpp-python with optional LoRA adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction via model_path).
        adapter_id: LoRA adapter ID to apply. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction. Prepended to
            the prompt for llama.cpp (no native chat template support).
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    lora_path: str | None = None
    if adapter_id:
        if adapter_id not in self._loaded_adapters:
            raise ValueError(
                f"Adapter '{adapter_id}' has not been loaded. "
                "Call load_adapter() first."
            )
        lora_path = self._loaded_adapters[adapter_id]

    self._load_model_if_needed(lora_path=lora_path)

    full_prompt = f"{system_prompt}\n\n{prompt}" if system_prompt else prompt
    response = self._llm(  # type: ignore[union-attr]
        full_prompt,
        max_tokens=max_tokens,
        stop=_STOP_SEQUENCES,
    )

    text = response["choices"][0]["text"]
    token_count = response["usage"]["total_tokens"]
    finish_reason = response["choices"][0].get("finish_reason", "stop")

    return GenerationResult(
        text=text,
        model=Path(self._model_path).stem,
        adapter_id=adapter_id,
        token_count=token_count,
        finish_reason=finish_reason,
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Register a LoRA adapter for use during generation.

The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter.

required
adapter_path str

Filesystem path to the LoRA adapter weights (GGUF format).

required
Source code in libs/inference/src/inference/llamacpp_provider.py
161
162
163
164
165
166
167
168
169
170
171
172
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a LoRA adapter for use during generation.

    The adapter is applied on next generate() call by reloading the model
    with the LoRA path. llama-cpp-python applies LoRA at model load time.

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Filesystem path to the LoRA adapter weights (GGUF format).
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Remove a registered LoRA adapter.

If the currently active adapter is unloaded, the model will reload without it on the next generate() call.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove.

required
Source code in libs/inference/src/inference/llamacpp_provider.py
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered LoRA adapter.

    If the currently active adapter is unloaded, the model will reload
    without it on the next generate() call.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        was_active = self._current_lora == self._loaded_adapters[adapter_id]
        del self._loaded_adapters[adapter_id]
        if was_active:
            self._current_lora = None  # Force reload without adapter
        logger.info("Unloaded adapter %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all registered LoRA adapter IDs.

Returns:

Type Description
list[str]

Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/llamacpp_provider.py
190
191
192
193
194
195
196
async def list_adapters(self) -> list[str]:
    """List all registered LoRA adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

ollama_provider

OllamaProvider: InferenceProvider implementation backed by an Ollama server.

Uses Ollama's OpenAI-compatible endpoint for generation. Adapter operations raise UnsupportedOperationError since Ollama has no LoRA adapter concept.

Classes
OllamaProvider
OllamaProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by an Ollama server.

Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.

Note

Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.

Attributes:

Name Type Description
_client

AsyncOpenAI client pointing at the Ollama server.

Example

provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")

Initialize OllamaProvider with an AsyncOpenAI client.

Parameters:

Name Type Description Default
base_url str | None

Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1.

None
Source code in libs/inference/src/inference/ollama_provider.py
39
40
41
42
43
44
45
46
47
48
49
50
51
def __init__(self, base_url: str | None = None) -> None:
    """Initialize OllamaProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the Ollama server. Defaults to
            OLLAMA_BASE_URL env var or http://localhost:11434/v1.
    """
    self._base_url = base_url or OLLAMA_BASE_URL
    # Ollama requires a non-empty api_key but ignores its value.
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="ollama",
    )
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt using the base Ollama model.

If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

The Ollama model identifier (e.g. "qwen2.5-coder:7b").

required
adapter_id str | None

Ignored. If provided, a warning is logged.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with adapter_id=None (Ollama has no adapter concept).

Example

result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)

Source code in libs/inference/src/inference/ollama_provider.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt using the base Ollama model.

    If adapter_id is provided, a warning is logged and it is ignored —
    Ollama does not support LoRA adapters. The base model is always used.

    Args:
        prompt: The user-facing input prompt.
        model: The Ollama model identifier (e.g. "qwen2.5-coder:7b").
        adapter_id: Ignored. If provided, a warning is logged.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with adapter_id=None (Ollama has no adapter concept).

    Example:
        >>> result = await provider.generate("def fib", model="qwen2.5-coder:7b")
        >>> print(result.text)
    """
    if adapter_id is not None:
        logger.warning(
            "OllamaProvider ignoring adapter_id=%s; "
            "Ollama does not support LoRA adapters.",
            adapter_id,
        )

    logger.debug("generate: model=%s max_tokens=%d", model, max_tokens)

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=None,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name Type Description Default
adapter_id str

Unused.

required
adapter_path str

Unused.

required

Raises:

Type Description
UnsupportedOperationError

Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Raises UnsupportedOperationError
Source code in libs/inference/src/inference/ollama_provider.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.
        adapter_path: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter loading. Use VLLMProvider for adapter operations.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter loading. "
        "Use VLLMProvider for adapter operations."
    )
unload_adapter async
unload_adapter(adapter_id: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name Type Description Default
adapter_id str

Unused.

required

Raises:

Type Description
UnsupportedOperationError

Always — Ollama does not support LoRA adapter unloading.

Example

await provider.unload_adapter("adapter-001")

Raises UnsupportedOperationError
Source code in libs/inference/src/inference/ollama_provider.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
async def unload_adapter(self, adapter_id: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter unloading.

    Example:
        >>> await provider.unload_adapter("adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter unloading."
    )
list_adapters async
list_adapters() -> list[str]

Return an empty list — Ollama has no adapter concept.

Returns:

Type Description
list[str]

Always returns an empty list.

Example

adapters = await provider.list_adapters() print(adapters) # []

Source code in libs/inference/src/inference/ollama_provider.py
152
153
154
155
156
157
158
159
160
161
162
async def list_adapters(self) -> list[str]:
    """Return an empty list — Ollama has no adapter concept.

    Returns:
        Always returns an empty list.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # []
    """
    return []

provider

Abstract base class and shared types for inference providers.

Defines the provider-agnostic API that the agent loop consumes. Concrete implementations (VLLMProvider, OllamaProvider) fulfil this interface for their respective backends.

Classes
GenerationResult dataclass
GenerationResult(
    text: str,
    model: str,
    adapter_id: str | None,
    token_count: int,
    finish_reason: str,
)

Structured result returned by InferenceProvider.generate().

Attributes:

Name Type Description
text str

The generated text output from the model.

model str

The model identifier used for generation.

adapter_id str | None

The LoRA adapter applied during generation, or None if no adapter was used.

token_count int

Total number of tokens consumed (prompt + completion).

finish_reason str

Reason generation stopped (e.g. "stop", "length").

Example

result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )

InferenceProvider

Bases: ABC

Abstract base class for inference providers.

Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.

Concrete implementations
  • VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
  • OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.
Functions
generate abstractmethod async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

The model identifier to use for generation.

required
adapter_id str | None

Optional LoRA adapter to apply during generation. If None, uses the base model directly.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction. Providers that support chat templates will format this as a system message.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

A GenerationResult containing the generated text and metadata.

Example

result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/provider.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@abstractmethod
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt.

    Args:
        prompt: The user-facing input prompt.
        model: The model identifier to use for generation.
        adapter_id: Optional LoRA adapter to apply during generation.
            If None, uses the base model directly.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction. Providers that
            support chat templates will format this as a system message.

        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        A GenerationResult containing the generated text and metadata.

    Example:
        >>> result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    ...
load_adapter abstractmethod async
load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the inference server.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter (used as lora_name in vLLM).

required
adapter_path str

Filesystem path to the adapter weights directory.

required

Raises:

Type Description
UnsupportedOperationError

If the provider does not support adapters.

HTTPStatusError

If the server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/provider.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
@abstractmethod
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the inference server.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name in vLLM).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    ...
unload_adapter abstractmethod async
unload_adapter(adapter_id: str) -> None

Unload a previously loaded LoRA adapter.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove from the server.

required

Raises:

Type Description
UnsupportedOperationError

If the provider does not support adapters.

HTTPStatusError

If the server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/provider.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@abstractmethod
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a previously loaded LoRA adapter.

    Args:
        adapter_id: The adapter name to remove from the server.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    ...
list_adapters abstractmethod async
list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns:

Type Description
list[str]

Sorted list of adapter IDs currently available for inference.

list[str]

Returns an empty list if no adapters are loaded or the provider

list[str]

does not support adapters.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/provider.py
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@abstractmethod
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns:
        Sorted list of adapter IDs currently available for inference.
        Returns an empty list if no adapters are loaded or the provider
        does not support adapters.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    ...

transformers_provider

TransformersProvider: InferenceProvider using HuggingFace transformers + PEFT.

Loads models via AutoModelForCausalLM and applies LoRA adapters via PEFT. This is the only provider that natively supports PEFT-format adapters (safetensors) as output by the hypernetwork.

IMPORTANT: transformers, torch, and peft are imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI.

Classes
TransformersProvider
TransformersProvider(
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
)

Bases: InferenceProvider

InferenceProvider backed by HuggingFace transformers with PEFT LoRA.

Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID or local path.

''
device str

Device to load model onto ('cpu', 'mps', 'cuda').

'cpu'
torch_dtype str

Model dtype ('auto', 'float16', 'bfloat16').

'auto'
Example

provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")

Initialize TransformersProvider.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID or local path.

''
device str

Device to load model onto.

'cpu'
torch_dtype str

Model dtype string.

'auto'
Source code in libs/inference/src/inference/transformers_provider.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __init__(
    self,
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
) -> None:
    """Initialize TransformersProvider.

    Args:
        model_name: HuggingFace model ID or local path.
        device: Device to load model onto.
        torch_dtype: Model dtype string.
    """
    self._model_name = model_name
    self._device = device
    self._torch_dtype = torch_dtype
    self._model: Any = None
    self._tokenizer: Any = None
    self._base_model: Any = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
    self._active_adapter: str | None = None
    self._is_peft_wrapped: bool = False
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using transformers with optional PEFT adapter.

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Ignored (model is set at construction).

required
adapter_id str | None

LoRA adapter ID to activate. Must be loaded via load_adapter() before use.

None
max_tokens int

Maximum tokens to generate.

4096
system_prompt str | None

Optional system-level instruction prepended via the tokenizer's chat template when available.

None
temperature float | None

Sampling temperature (default from pipeline config).

None
top_p float | None

Nucleus sampling threshold (default from pipeline config).

None
repetition_penalty float | None

Repetition penalty (default 1.0 = off).

None

Returns:

Type Description
GenerationResult

GenerationResult with generated text and metadata.

Raises:

Type Description
ValueError

If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/transformers_provider.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using transformers with optional PEFT adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction).
        adapter_id: LoRA adapter ID to activate. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction prepended via
            the tokenizer's chat template when available.
        temperature: Sampling temperature (default from pipeline config).
        top_p: Nucleus sampling threshold (default from pipeline config).
        repetition_penalty: Repetition penalty (default 1.0 = off).

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    import torch  # noqa: PLC0415

    self._load_model_if_needed()

    # Apply defaults from pipeline config
    if temperature is None:
        temperature = float(os.environ.get("RUNE_TEMPERATURE", "0.25"))
    if top_p is None:
        top_p = float(os.environ.get("RUNE_TOP_P", "0.9"))
    if repetition_penalty is None:
        repetition_penalty = float(
            os.environ.get("RUNE_REPETITION_PENALTY", "1.04")
        )

    # Validate adapter before switching
    if adapter_id and adapter_id not in self._loaded_adapters:
        raise ValueError(
            f"Adapter '{adapter_id}' has not been loaded. "
            "Call load_adapter() first."
        )

    # Switch adapter if needed
    if adapter_id and adapter_id != self._active_adapter:
        self._activate_adapter(adapter_id)
    elif not adapter_id and self._active_adapter:
        self._deactivate_adapter()

    # Build chat-formatted prompt via tokenizer's chat template
    formatted = self._format_prompt(prompt, system_prompt)
    inputs = self._tokenizer(
        formatted, return_tensors="pt", truncation=True, max_length=8192
    )
    inputs = {k: v.to(self._device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[1]

    gen_kwargs: dict[str, object] = {
        "max_new_tokens": max_tokens,
        "do_sample": temperature > 0,
        "temperature": max(temperature, 0.01),
        "top_p": top_p,
        "pad_token_id": self._tokenizer.pad_token_id,
    }
    if repetition_penalty > 1.0:
        gen_kwargs["repetition_penalty"] = repetition_penalty

    with torch.no_grad():
        outputs = self._model.generate(**inputs, **gen_kwargs)

    new_tokens = outputs[0][input_len:]
    text = self._tokenizer.decode(new_tokens, skip_special_tokens=True)
    total_tokens = outputs.shape[1]
    new_token_count = len(new_tokens)

    # Detect truncation: generated exactly max_tokens means cut off
    finish_reason = "length" if new_token_count >= max_tokens else "stop"

    return GenerationResult(
        text=text,
        model=self._model_name,
        adapter_id=self._active_adapter,
        token_count=total_tokens,
        finish_reason=finish_reason,
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Register a PEFT adapter directory for use during generation.

The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter.

required
adapter_path str

Path to the PEFT adapter directory.

required
Source code in libs/inference/src/inference/transformers_provider.py
267
268
269
270
271
272
273
274
275
276
277
278
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a PEFT adapter directory for use during generation.

    The adapter directory must contain adapter_model.safetensors and
    adapter_config.json as output by save_hypernetwork_adapter().

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Path to the PEFT adapter directory.
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Remove a registered adapter, freeing GPU memory.

Parameters:

Name Type Description Default
adapter_id str

The adapter name to remove.

required
Source code in libs/inference/src/inference/transformers_provider.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered adapter, freeing GPU memory.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        if self._active_adapter == adapter_id:
            self._deactivate_adapter()
        # Delete from PeftModel to free GPU memory
        if self._is_peft_wrapped and adapter_id in self._model.peft_config:
            self._model.delete_adapter(adapter_id)
        del self._loaded_adapters[adapter_id]
        # If no adapters remain, revert to base model
        if not self._loaded_adapters and self._is_peft_wrapped:
            self._model = self._base_model
            self._is_peft_wrapped = False
            logger.info("All adapters removed, reverted to base model")
        else:
            logger.info("Unloaded adapter %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all registered adapter IDs.

Returns:

Type Description
list[str]

Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/transformers_provider.py
301
302
303
304
305
306
307
async def list_adapters(self) -> list[str]:
    """List all registered adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

vllm_provider

VLLMProvider: InferenceProvider implementation backed by a vLLM server.

Uses the OpenAI-compatible API for generation and vLLM's proprietary LoRA management endpoints for hot-loading adapters at runtime.

Classes
VLLMProvider
VLLMProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by a vLLM server with LoRA hot-loading support.

Communicates with vLLM via two channels
  • AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
  • httpx for LoRA adapter management (vLLM proprietary endpoints).

Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).

Attributes:

Name Type Description
_client

AsyncOpenAI client pointing at the vLLM server.

_base_url

Base URL string for constructing adapter management URLs.

_loaded_adapters set[str]

Set of currently tracked adapter IDs.

Example

provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")

Initialize VLLMProvider with an AsyncOpenAI client.

Parameters:

Name Type Description Default
base_url str | None

Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1.

None
Source code in libs/inference/src/inference/vllm_provider.py
40
41
42
43
44
45
46
47
48
49
50
51
52
def __init__(self, base_url: str | None = None) -> None:
    """Initialize VLLMProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the vLLM server. Defaults to
            VLLM_BASE_URL env var or http://localhost:8100/v1.
    """
    self._base_url = base_url or VLLM_BASE_URL
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="not-needed-for-local-vllm",
    )
    self._loaded_adapters: set[str] = set()
Functions
generate async
generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt, optionally using a loaded LoRA adapter.

When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).

Parameters:

Name Type Description Default
prompt str

The user-facing input prompt.

required
model str

Base model identifier. Used as-is when no adapter is given.

required
adapter_id str | None

Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call.

None
max_tokens int

Maximum number of tokens to generate.

4096
system_prompt str | None

Optional system-level instruction.

None
temperature float | None

Sampling temperature override.

None
top_p float | None

Nucleus sampling threshold override.

None
repetition_penalty float | None

Repetition penalty override.

None

Returns:

Type Description
GenerationResult

GenerationResult with the generated text and metadata.

Example

result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/vllm_provider.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt, optionally using a loaded LoRA adapter.

    When adapter_id is provided, it is passed as the model parameter to
    the OpenAI API — this is how vLLM identifies and routes to loaded
    LoRA adapters (the adapter is referenced by its lora_name).

    Args:
        prompt: The user-facing input prompt.
        model: Base model identifier. Used as-is when no adapter is given.
        adapter_id: Name of a loaded LoRA adapter to apply. When set,
            this value replaces model in the API call.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with the generated text and metadata.

    Example:
        >>> result = await provider.generate("def fib", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    effective_model = adapter_id if adapter_id is not None else model
    logger.debug(
        "generate: model=%s adapter_id=%s max_tokens=%d",
        effective_model,
        adapter_id,
        max_tokens,
    )

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=effective_model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=adapter_id,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )
load_adapter async
load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the vLLM server.

Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.

Parameters:

Name Type Description Default
adapter_id str

Unique name for the adapter (used as lora_name).

required
adapter_path str

Filesystem path to the adapter weights directory.

required

Raises:

Type Description
HTTPStatusError

If the vLLM server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the vLLM server.

    Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter
    to the internal tracking set on success.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/load_lora_adapter"
    logger.debug("load_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id, "lora_path": adapter_path},
        )
        response.raise_for_status()

    self._loaded_adapters.add(adapter_id)
    logger.info("Adapter loaded: %s", adapter_id)
unload_adapter async
unload_adapter(adapter_id: str) -> None

Unload a LoRA adapter from the vLLM server.

Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.

Parameters:

Name Type Description Default
adapter_id str

Name of the adapter to unload.

required

Raises:

Type Description
HTTPStatusError

If the vLLM server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a LoRA adapter from the vLLM server.

    Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the
    adapter from the internal tracking set.

    Args:
        adapter_id: Name of the adapter to unload.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/unload_lora_adapter"
    logger.debug("unload_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id},
        )
        response.raise_for_status()

    self._loaded_adapters.discard(adapter_id)
    logger.info("Adapter unloaded: %s", adapter_id)
list_adapters async
list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).

Returns:

Type Description
list[str]

Sorted list of adapter IDs currently tracked as loaded.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/vllm_provider.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns the internal tracking set rather than querying vLLM to avoid
    the unreliable list endpoint (vLLM bug #11761).

    Returns:
        Sorted list of adapter IDs currently tracked as loaded.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    return sorted(self._loaded_adapters)