API Reference for inference¶
inference ¶
Inference provider library for LLM generation and LoRA adapter management.
Provides a provider-agnostic interface (InferenceProvider) with vLLM and Ollama backends, a factory for backend selection by configuration, and structured generation results.
Provider classes (OllamaProvider, VLLMProvider) are lazily imported to avoid
hard failures when the openai package is not installed (e.g. in CI).
Classes¶
UnsupportedOperationError ¶
Bases: Exception
Raised when a provider does not support the requested operation.
Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.
Example
raise UnsupportedOperationError("OllamaProvider does not support adapters.")
GenerationResult
dataclass
¶
GenerationResult(
text: str,
model: str,
adapter_id: str | None,
token_count: int,
finish_reason: str,
)
Structured result returned by InferenceProvider.generate().
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The generated text output from the model. |
model |
str
|
The model identifier used for generation. |
adapter_id |
str | None
|
The LoRA adapter applied during generation, or None if no adapter was used. |
token_count |
int
|
Total number of tokens consumed (prompt + completion). |
finish_reason |
str
|
Reason generation stopped (e.g. "stop", "length"). |
Example
result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )
InferenceProvider ¶
Bases: ABC
Abstract base class for inference providers.
Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.
Concrete implementations
- VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
- OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.
Functions¶
generate
abstractmethod
async
¶
generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
The model identifier to use for generation. |
required |
adapter_id
|
str | None
|
Optional LoRA adapter to apply during generation. If None, uses the base model directly. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. Providers that support chat templates will format this as a system message. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
A GenerationResult containing the generated text and metadata. |
Example
result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)
Source code in libs/inference/src/inference/provider.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
load_adapter
abstractmethod
async
¶
load_adapter(adapter_id: str, adapter_path: str) -> None
Load a LoRA adapter into the inference server.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter (used as lora_name in vLLM). |
required |
adapter_path
|
str
|
Filesystem path to the adapter weights directory. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
If the provider does not support adapters. |
HTTPStatusError
|
If the server returns an error response. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Source code in libs/inference/src/inference/provider.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | |
unload_adapter
abstractmethod
async
¶
unload_adapter(adapter_id: str) -> None
Unload a previously loaded LoRA adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove from the server. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
If the provider does not support adapters. |
HTTPStatusError
|
If the server returns an error response. |
Example
await provider.unload_adapter("adapter-001")
Source code in libs/inference/src/inference/provider.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
list_adapters
abstractmethod
async
¶
list_adapters() -> list[str]
List all currently loaded LoRA adapters.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of adapter IDs currently available for inference. |
list[str]
|
Returns an empty list if no adapters are loaded or the provider |
list[str]
|
does not support adapters. |
Example
adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]
Source code in libs/inference/src/inference/provider.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
LlamaCppProvider ¶
LlamaCppProvider(
model_path: str | None = None,
n_ctx: int = 4096,
n_gpu_layers: int = -1,
)
Bases: InferenceProvider
InferenceProvider backed by llama-cpp-python with native LoRA support.
Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.
The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | None
|
Path to the GGUF model file. |
None
|
n_ctx
|
int
|
Context window size. Default: 4096. |
4096
|
n_gpu_layers
|
int
|
Layers to offload to GPU (-1 = all). Default: -1. |
-1
|
Example
provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")
Initialize LlamaCppProvider with model configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | None
|
Path to the GGUF model file. |
None
|
n_ctx
|
int
|
Context window size. Default: 4096. |
4096
|
n_gpu_layers
|
int
|
Layers to offload to GPU (-1 = all). Default: -1. |
-1
|
Source code in libs/inference/src/inference/llamacpp_provider.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
Functions¶
generate
async
¶
generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text using llama-cpp-python with optional LoRA adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Ignored (model is set at construction via model_path). |
required |
adapter_id
|
str | None
|
LoRA adapter ID to apply. Must be loaded via load_adapter() before use. |
None
|
max_tokens
|
int
|
Maximum tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support). |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with generated text and metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If adapter_id is provided but has not been loaded. |
Source code in libs/inference/src/inference/llamacpp_provider.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
load_adapter
async
¶
load_adapter(adapter_id: str, adapter_path: str) -> None
Register a LoRA adapter for use during generation.
The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter. |
required |
adapter_path
|
str
|
Filesystem path to the LoRA adapter weights (GGUF format). |
required |
Source code in libs/inference/src/inference/llamacpp_provider.py
161 162 163 164 165 166 167 168 169 170 171 172 | |
unload_adapter
async
¶
unload_adapter(adapter_id: str) -> None
Remove a registered LoRA adapter.
If the currently active adapter is unloaded, the model will reload without it on the next generate() call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove. |
required |
Source code in libs/inference/src/inference/llamacpp_provider.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
list_adapters
async
¶
list_adapters() -> list[str]
List all registered LoRA adapter IDs.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of registered adapter IDs. |
Source code in libs/inference/src/inference/llamacpp_provider.py
190 191 192 193 194 195 196 | |
OllamaProvider ¶
OllamaProvider(base_url: str | None = None)
Bases: InferenceProvider
InferenceProvider backed by an Ollama server.
Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.
Note
Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.
Attributes:
| Name | Type | Description |
|---|---|---|
_client |
AsyncOpenAI client pointing at the Ollama server. |
Example
provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")
Initialize OllamaProvider with an AsyncOpenAI client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1. |
None
|
Source code in libs/inference/src/inference/ollama_provider.py
39 40 41 42 43 44 45 46 47 48 49 50 51 | |
Functions¶
generate
async
¶
generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt using the base Ollama model.
If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
The Ollama model identifier (e.g. "qwen2.5-coder:7b"). |
required |
adapter_id
|
str | None
|
Ignored. If provided, a warning is logged. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with adapter_id=None (Ollama has no adapter concept). |
Example
result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)
Source code in libs/inference/src/inference/ollama_provider.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
load_adapter
async
¶
load_adapter(adapter_id: str, adapter_path: str) -> None
Not supported by Ollama. Always raises UnsupportedOperationError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unused. |
required |
adapter_path
|
str
|
Unused. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Raises UnsupportedOperationError¶
Source code in libs/inference/src/inference/ollama_provider.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
unload_adapter
async
¶
unload_adapter(adapter_id: str) -> None
Not supported by Ollama. Always raises UnsupportedOperationError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unused. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
Always — Ollama does not support LoRA adapter unloading. |
Source code in libs/inference/src/inference/ollama_provider.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
list_adapters
async
¶
list_adapters() -> list[str]
Return an empty list — Ollama has no adapter concept.
Returns:
| Type | Description |
|---|---|
list[str]
|
Always returns an empty list. |
Example
adapters = await provider.list_adapters() print(adapters) # []
Source code in libs/inference/src/inference/ollama_provider.py
152 153 154 155 156 157 158 159 160 161 162 | |
TransformersProvider ¶
TransformersProvider(
model_name: str = "",
device: str = "cpu",
torch_dtype: str = "auto",
)
Bases: InferenceProvider
InferenceProvider backed by HuggingFace transformers with PEFT LoRA.
Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID or local path. |
''
|
device
|
str
|
Device to load model onto ('cpu', 'mps', 'cuda'). |
'cpu'
|
torch_dtype
|
str
|
Model dtype ('auto', 'float16', 'bfloat16'). |
'auto'
|
Example
provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")
Initialize TransformersProvider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID or local path. |
''
|
device
|
str
|
Device to load model onto. |
'cpu'
|
torch_dtype
|
str
|
Model dtype string. |
'auto'
|
Source code in libs/inference/src/inference/transformers_provider.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
Functions¶
generate
async
¶
generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text using transformers with optional PEFT adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Ignored (model is set at construction). |
required |
adapter_id
|
str | None
|
LoRA adapter ID to activate. Must be loaded via load_adapter() before use. |
None
|
max_tokens
|
int
|
Maximum tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction prepended via the tokenizer's chat template when available. |
None
|
temperature
|
float | None
|
Sampling temperature (default from pipeline config). |
None
|
top_p
|
float | None
|
Nucleus sampling threshold (default from pipeline config). |
None
|
repetition_penalty
|
float | None
|
Repetition penalty (default 1.0 = off). |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with generated text and metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If adapter_id is provided but has not been loaded. |
Source code in libs/inference/src/inference/transformers_provider.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
load_adapter
async
¶
load_adapter(adapter_id: str, adapter_path: str) -> None
Register a PEFT adapter directory for use during generation.
The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter. |
required |
adapter_path
|
str
|
Path to the PEFT adapter directory. |
required |
Source code in libs/inference/src/inference/transformers_provider.py
267 268 269 270 271 272 273 274 275 276 277 278 | |
unload_adapter
async
¶
unload_adapter(adapter_id: str) -> None
Remove a registered adapter, freeing GPU memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove. |
required |
Source code in libs/inference/src/inference/transformers_provider.py
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | |
list_adapters
async
¶
list_adapters() -> list[str]
List all registered adapter IDs.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of registered adapter IDs. |
Source code in libs/inference/src/inference/transformers_provider.py
301 302 303 304 305 306 307 | |
VLLMProvider ¶
VLLMProvider(base_url: str | None = None)
Bases: InferenceProvider
InferenceProvider backed by a vLLM server with LoRA hot-loading support.
Communicates with vLLM via two channels
- AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
- httpx for LoRA adapter management (vLLM proprietary endpoints).
Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).
Attributes:
| Name | Type | Description |
|---|---|---|
_client |
AsyncOpenAI client pointing at the vLLM server. |
|
_base_url |
Base URL string for constructing adapter management URLs. |
|
_loaded_adapters |
set[str]
|
Set of currently tracked adapter IDs. |
Example
provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
Initialize VLLMProvider with an AsyncOpenAI client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1. |
None
|
Source code in libs/inference/src/inference/vllm_provider.py
40 41 42 43 44 45 46 47 48 49 50 51 52 | |
Functions¶
generate
async
¶
generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt, optionally using a loaded LoRA adapter.
When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Base model identifier. Used as-is when no adapter is given. |
required |
adapter_id
|
str | None
|
Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with the generated text and metadata. |
Example
result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)
Source code in libs/inference/src/inference/vllm_provider.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
load_adapter
async
¶
load_adapter(adapter_id: str, adapter_path: str) -> None
Load a LoRA adapter into the vLLM server.
Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter (used as lora_name). |
required |
adapter_path
|
str
|
Filesystem path to the adapter weights directory. |
required |
Raises:
| Type | Description |
|---|---|
HTTPStatusError
|
If the vLLM server returns an error response. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Source code in libs/inference/src/inference/vllm_provider.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
unload_adapter
async
¶
unload_adapter(adapter_id: str) -> None
Unload a LoRA adapter from the vLLM server.
Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Name of the adapter to unload. |
required |
Raises:
| Type | Description |
|---|---|
HTTPStatusError
|
If the vLLM server returns an error response. |
Example
await provider.unload_adapter("adapter-001")
Source code in libs/inference/src/inference/vllm_provider.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
list_adapters
async
¶
list_adapters() -> list[str]
List all currently loaded LoRA adapters.
Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of adapter IDs currently tracked as loaded. |
Example
adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]
Source code in libs/inference/src/inference/vllm_provider.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
Functions¶
get_provider ¶
get_provider(
provider_type: str | None = None,
base_url: str | None = None,
) -> InferenceProvider
Return a cached InferenceProvider for the given backend.
Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider_type
|
str | None
|
One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm"). |
None
|
base_url
|
str | None
|
Override URL for the backend server. If None, the per-backend default env var is used. |
None
|
Returns:
| Type | Description |
|---|---|
InferenceProvider
|
A cached InferenceProvider instance for the requested backend. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provider_type is not "vllm", "ollama", or "llamacpp". |
Example
provider = get_provider("vllm") isinstance(provider, VLLMProvider) True
Source code in libs/inference/src/inference/factory.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
get_provider_for_step ¶
get_provider_for_step(
step_config: dict[str, str],
) -> InferenceProvider
Return a cached InferenceProvider configured from a step config dict.
Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
step_config
|
dict[str, str]
|
Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server. |
required |
Returns:
| Type | Description |
|---|---|
InferenceProvider
|
A cached InferenceProvider instance for the step's backend. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider type in step_config is not supported. |
Example
provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True
Source code in libs/inference/src/inference/factory.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
Modules¶
exceptions ¶
Custom exceptions for the inference library.
Classes¶
UnsupportedOperationError ¶
Bases: Exception
Raised when a provider does not support the requested operation.
Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.
Example
raise UnsupportedOperationError("OllamaProvider does not support adapters.")
factory ¶
Provider factory with instance cache for the inference library.
Selects between VLLMProvider and OllamaProvider based on configuration, caching instances by (provider_type, base_url) to avoid redundant construction.
Classes¶
Functions¶
get_provider ¶
get_provider(
provider_type: str | None = None,
base_url: str | None = None,
) -> InferenceProvider
Return a cached InferenceProvider for the given backend.
Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider_type
|
str | None
|
One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm"). |
None
|
base_url
|
str | None
|
Override URL for the backend server. If None, the per-backend default env var is used. |
None
|
Returns:
| Type | Description |
|---|---|
InferenceProvider
|
A cached InferenceProvider instance for the requested backend. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provider_type is not "vllm", "ollama", or "llamacpp". |
Example
provider = get_provider("vllm") isinstance(provider, VLLMProvider) True
Source code in libs/inference/src/inference/factory.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
get_provider_for_step ¶
get_provider_for_step(
step_config: dict[str, str],
) -> InferenceProvider
Return a cached InferenceProvider configured from a step config dict.
Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
step_config
|
dict[str, str]
|
Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server. |
required |
Returns:
| Type | Description |
|---|---|
InferenceProvider
|
A cached InferenceProvider instance for the step's backend. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provider type in step_config is not supported. |
Example
provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True
Source code in libs/inference/src/inference/factory.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
llamacpp_provider ¶
LlamaCppProvider: InferenceProvider using llama-cpp-python with LoRA support.
Loads GGUF models via llama_cpp.Llama with optional LoRA adapter paths. Designed for Apple Silicon (Metal) local inference where adapter hot-loading is needed — Ollama cannot load LoRA adapters, and vLLM requires a server.
IMPORTANT: llama_cpp is imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI without llama-cpp-python.
Classes¶
LlamaCppProvider ¶
LlamaCppProvider(
model_path: str | None = None,
n_ctx: int = 4096,
n_gpu_layers: int = -1,
)
Bases: InferenceProvider
InferenceProvider backed by llama-cpp-python with native LoRA support.
Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.
The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | None
|
Path to the GGUF model file. |
None
|
n_ctx
|
int
|
Context window size. Default: 4096. |
4096
|
n_gpu_layers
|
int
|
Layers to offload to GPU (-1 = all). Default: -1. |
-1
|
Example
provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")
Initialize LlamaCppProvider with model configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | None
|
Path to the GGUF model file. |
None
|
n_ctx
|
int
|
Context window size. Default: 4096. |
4096
|
n_gpu_layers
|
int
|
Layers to offload to GPU (-1 = all). Default: -1. |
-1
|
Source code in libs/inference/src/inference/llamacpp_provider.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
async
¶generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text using llama-cpp-python with optional LoRA adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Ignored (model is set at construction via model_path). |
required |
adapter_id
|
str | None
|
LoRA adapter ID to apply. Must be loaded via load_adapter() before use. |
None
|
max_tokens
|
int
|
Maximum tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support). |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with generated text and metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If adapter_id is provided but has not been loaded. |
Source code in libs/inference/src/inference/llamacpp_provider.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
async
¶load_adapter(adapter_id: str, adapter_path: str) -> None
Register a LoRA adapter for use during generation.
The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter. |
required |
adapter_path
|
str
|
Filesystem path to the LoRA adapter weights (GGUF format). |
required |
Source code in libs/inference/src/inference/llamacpp_provider.py
161 162 163 164 165 166 167 168 169 170 171 172 | |
async
¶unload_adapter(adapter_id: str) -> None
Remove a registered LoRA adapter.
If the currently active adapter is unloaded, the model will reload without it on the next generate() call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove. |
required |
Source code in libs/inference/src/inference/llamacpp_provider.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
async
¶list_adapters() -> list[str]
List all registered LoRA adapter IDs.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of registered adapter IDs. |
Source code in libs/inference/src/inference/llamacpp_provider.py
190 191 192 193 194 195 196 | |
ollama_provider ¶
OllamaProvider: InferenceProvider implementation backed by an Ollama server.
Uses Ollama's OpenAI-compatible endpoint for generation. Adapter operations raise UnsupportedOperationError since Ollama has no LoRA adapter concept.
Classes¶
OllamaProvider ¶
OllamaProvider(base_url: str | None = None)
Bases: InferenceProvider
InferenceProvider backed by an Ollama server.
Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.
Note
Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.
Attributes:
| Name | Type | Description |
|---|---|---|
_client |
AsyncOpenAI client pointing at the Ollama server. |
Example
provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")
Initialize OllamaProvider with an AsyncOpenAI client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1. |
None
|
Source code in libs/inference/src/inference/ollama_provider.py
39 40 41 42 43 44 45 46 47 48 49 50 51 | |
async
¶generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt using the base Ollama model.
If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
The Ollama model identifier (e.g. "qwen2.5-coder:7b"). |
required |
adapter_id
|
str | None
|
Ignored. If provided, a warning is logged. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with adapter_id=None (Ollama has no adapter concept). |
Example
result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)
Source code in libs/inference/src/inference/ollama_provider.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
async
¶load_adapter(adapter_id: str, adapter_path: str) -> None
Not supported by Ollama. Always raises UnsupportedOperationError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unused. |
required |
adapter_path
|
str
|
Unused. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Raises UnsupportedOperationError¶
Source code in libs/inference/src/inference/ollama_provider.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
async
¶unload_adapter(adapter_id: str) -> None
Not supported by Ollama. Always raises UnsupportedOperationError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unused. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
Always — Ollama does not support LoRA adapter unloading. |
Source code in libs/inference/src/inference/ollama_provider.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
async
¶list_adapters() -> list[str]
Return an empty list — Ollama has no adapter concept.
Returns:
| Type | Description |
|---|---|
list[str]
|
Always returns an empty list. |
Example
adapters = await provider.list_adapters() print(adapters) # []
Source code in libs/inference/src/inference/ollama_provider.py
152 153 154 155 156 157 158 159 160 161 162 | |
provider ¶
Abstract base class and shared types for inference providers.
Defines the provider-agnostic API that the agent loop consumes. Concrete implementations (VLLMProvider, OllamaProvider) fulfil this interface for their respective backends.
Classes¶
GenerationResult
dataclass
¶
GenerationResult(
text: str,
model: str,
adapter_id: str | None,
token_count: int,
finish_reason: str,
)
Structured result returned by InferenceProvider.generate().
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The generated text output from the model. |
model |
str
|
The model identifier used for generation. |
adapter_id |
str | None
|
The LoRA adapter applied during generation, or None if no adapter was used. |
token_count |
int
|
Total number of tokens consumed (prompt + completion). |
finish_reason |
str
|
Reason generation stopped (e.g. "stop", "length"). |
Example
result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )
InferenceProvider ¶
Bases: ABC
Abstract base class for inference providers.
Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.
Concrete implementations
- VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
- OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.
abstractmethod
async
¶generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
The model identifier to use for generation. |
required |
adapter_id
|
str | None
|
Optional LoRA adapter to apply during generation. If None, uses the base model directly. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. Providers that support chat templates will format this as a system message. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
A GenerationResult containing the generated text and metadata. |
Example
result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)
Source code in libs/inference/src/inference/provider.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
abstractmethod
async
¶load_adapter(adapter_id: str, adapter_path: str) -> None
Load a LoRA adapter into the inference server.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter (used as lora_name in vLLM). |
required |
adapter_path
|
str
|
Filesystem path to the adapter weights directory. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
If the provider does not support adapters. |
HTTPStatusError
|
If the server returns an error response. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Source code in libs/inference/src/inference/provider.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | |
abstractmethod
async
¶unload_adapter(adapter_id: str) -> None
Unload a previously loaded LoRA adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove from the server. |
required |
Raises:
| Type | Description |
|---|---|
UnsupportedOperationError
|
If the provider does not support adapters. |
HTTPStatusError
|
If the server returns an error response. |
Example
await provider.unload_adapter("adapter-001")
Source code in libs/inference/src/inference/provider.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | |
abstractmethod
async
¶list_adapters() -> list[str]
List all currently loaded LoRA adapters.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of adapter IDs currently available for inference. |
list[str]
|
Returns an empty list if no adapters are loaded or the provider |
list[str]
|
does not support adapters. |
Example
adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]
Source code in libs/inference/src/inference/provider.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
transformers_provider ¶
TransformersProvider: InferenceProvider using HuggingFace transformers + PEFT.
Loads models via AutoModelForCausalLM and applies LoRA adapters via PEFT. This is the only provider that natively supports PEFT-format adapters (safetensors) as output by the hypernetwork.
IMPORTANT: transformers, torch, and peft are imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI.
Classes¶
TransformersProvider ¶
TransformersProvider(
model_name: str = "",
device: str = "cpu",
torch_dtype: str = "auto",
)
Bases: InferenceProvider
InferenceProvider backed by HuggingFace transformers with PEFT LoRA.
Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID or local path. |
''
|
device
|
str
|
Device to load model onto ('cpu', 'mps', 'cuda'). |
'cpu'
|
torch_dtype
|
str
|
Model dtype ('auto', 'float16', 'bfloat16'). |
'auto'
|
Example
provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")
Initialize TransformersProvider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID or local path. |
''
|
device
|
str
|
Device to load model onto. |
'cpu'
|
torch_dtype
|
str
|
Model dtype string. |
'auto'
|
Source code in libs/inference/src/inference/transformers_provider.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
async
¶generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text using transformers with optional PEFT adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Ignored (model is set at construction). |
required |
adapter_id
|
str | None
|
LoRA adapter ID to activate. Must be loaded via load_adapter() before use. |
None
|
max_tokens
|
int
|
Maximum tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction prepended via the tokenizer's chat template when available. |
None
|
temperature
|
float | None
|
Sampling temperature (default from pipeline config). |
None
|
top_p
|
float | None
|
Nucleus sampling threshold (default from pipeline config). |
None
|
repetition_penalty
|
float | None
|
Repetition penalty (default 1.0 = off). |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with generated text and metadata. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If adapter_id is provided but has not been loaded. |
Source code in libs/inference/src/inference/transformers_provider.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
async
¶load_adapter(adapter_id: str, adapter_path: str) -> None
Register a PEFT adapter directory for use during generation.
The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter. |
required |
adapter_path
|
str
|
Path to the PEFT adapter directory. |
required |
Source code in libs/inference/src/inference/transformers_provider.py
267 268 269 270 271 272 273 274 275 276 277 278 | |
async
¶unload_adapter(adapter_id: str) -> None
Remove a registered adapter, freeing GPU memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
The adapter name to remove. |
required |
Source code in libs/inference/src/inference/transformers_provider.py
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | |
async
¶list_adapters() -> list[str]
List all registered adapter IDs.
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of registered adapter IDs. |
Source code in libs/inference/src/inference/transformers_provider.py
301 302 303 304 305 306 307 | |
vllm_provider ¶
VLLMProvider: InferenceProvider implementation backed by a vLLM server.
Uses the OpenAI-compatible API for generation and vLLM's proprietary LoRA management endpoints for hot-loading adapters at runtime.
Classes¶
VLLMProvider ¶
VLLMProvider(base_url: str | None = None)
Bases: InferenceProvider
InferenceProvider backed by a vLLM server with LoRA hot-loading support.
Communicates with vLLM via two channels
- AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
- httpx for LoRA adapter management (vLLM proprietary endpoints).
Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).
Attributes:
| Name | Type | Description |
|---|---|---|
_client |
AsyncOpenAI client pointing at the vLLM server. |
|
_base_url |
Base URL string for constructing adapter management URLs. |
|
_loaded_adapters |
set[str]
|
Set of currently tracked adapter IDs. |
Example
provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
Initialize VLLMProvider with an AsyncOpenAI client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_url
|
str | None
|
Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1. |
None
|
Source code in libs/inference/src/inference/vllm_provider.py
40 41 42 43 44 45 46 47 48 49 50 51 52 | |
async
¶generate(
prompt: str,
model: str,
adapter_id: str | None = None,
max_tokens: int = 4096,
system_prompt: str | None = None,
temperature: float | None = None,
top_p: float | None = None,
repetition_penalty: float | None = None,
) -> GenerationResult
Generate text from a prompt, optionally using a loaded LoRA adapter.
When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The user-facing input prompt. |
required |
model
|
str
|
Base model identifier. Used as-is when no adapter is given. |
required |
adapter_id
|
str | None
|
Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call. |
None
|
max_tokens
|
int
|
Maximum number of tokens to generate. |
4096
|
system_prompt
|
str | None
|
Optional system-level instruction. |
None
|
temperature
|
float | None
|
Sampling temperature override. |
None
|
top_p
|
float | None
|
Nucleus sampling threshold override. |
None
|
repetition_penalty
|
float | None
|
Repetition penalty override. |
None
|
Returns:
| Type | Description |
|---|---|
GenerationResult
|
GenerationResult with the generated text and metadata. |
Example
result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)
Source code in libs/inference/src/inference/vllm_provider.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
async
¶load_adapter(adapter_id: str, adapter_path: str) -> None
Load a LoRA adapter into the vLLM server.
Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Unique name for the adapter (used as lora_name). |
required |
adapter_path
|
str
|
Filesystem path to the adapter weights directory. |
required |
Raises:
| Type | Description |
|---|---|
HTTPStatusError
|
If the vLLM server returns an error response. |
Example
await provider.load_adapter("adapter-001", "/models/adapter-001")
Source code in libs/inference/src/inference/vllm_provider.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
async
¶unload_adapter(adapter_id: str) -> None
Unload a LoRA adapter from the vLLM server.
Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adapter_id
|
str
|
Name of the adapter to unload. |
required |
Raises:
| Type | Description |
|---|---|
HTTPStatusError
|
If the vLLM server returns an error response. |
Example
await provider.unload_adapter("adapter-001")
Source code in libs/inference/src/inference/vllm_provider.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
async
¶list_adapters() -> list[str]
List all currently loaded LoRA adapters.
Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted list of adapter IDs currently tracked as loaded. |
Example
adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]
Source code in libs/inference/src/inference/vllm_provider.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |