API Reference - Rune Documentation

UnsupportedOperationError ¶

Bases: Exception

Raised when a provider does not support the requested operation.

Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.

Example

raise UnsupportedOperationError("OllamaProvider does not support adapters.")

GenerationResult `dataclass` ¶

GenerationResult(
    text: str,
    model: str,
    adapter_id: str | None,
    token_count: int,
    finish_reason: str,
)

Structured result returned by InferenceProvider.generate().

Attributes:

Name	Type	Description
`text`	`str`	The generated text output from the model.
`model`	`str`	The model identifier used for generation.
`adapter_id`	`str \| None`	The LoRA adapter applied during generation, or None if no adapter was used.
`token_count`	`int`	Total number of tokens consumed (prompt + completion).
`finish_reason`	`str`	Reason generation stopped (e.g. "stop", "length").

Example

result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )

InferenceProvider ¶

Bases: ABC

Abstract base class for inference providers.

Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.

Concrete implementations

VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.

Functions¶

generate `abstractmethod` `async` ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	The model identifier to use for generation.	required
`adapter_id`	`str \| None`	Optional LoRA adapter to apply during generation. If None, uses the base model directly.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction. Providers that support chat templates will format this as a system message.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	A GenerationResult containing the generated text and metadata.

Example

result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt.

    Args:
        prompt: The user-facing input prompt.
        model: The model identifier to use for generation.
        adapter_id: Optional LoRA adapter to apply during generation.
            If None, uses the base model directly.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction. Providers that
            support chat templates will format this as a system message.

        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        A GenerationResult containing the generated text and metadata.

    Example:
        >>> result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    ...

load_adapter `abstractmethod` `async` ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the inference server.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter (used as lora_name in vLLM).	required
`adapter_path`	`str`	Filesystem path to the adapter weights directory.	required

Raises:

Type	Description
`UnsupportedOperationError`	If the provider does not support adapters.
`HTTPStatusError`	If the server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the inference server.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name in vLLM).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    ...

unload_adapter `abstractmethod` `async` ¶

unload_adapter(adapter_id: str) -> None

Unload a previously loaded LoRA adapter.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove from the server.	required

Raises:

Type	Description
`UnsupportedOperationError`	If the provider does not support adapters.
`HTTPStatusError`	If the server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a previously loaded LoRA adapter.

    Args:
        adapter_id: The adapter name to remove from the server.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    ...

list_adapters `abstractmethod` `async` ¶

list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns:

Type	Description
`list[str]`	Sorted list of adapter IDs currently available for inference.
`list[str]`	Returns an empty list if no adapters are loaded or the provider
`list[str]`	does not support adapters.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns:
        Sorted list of adapter IDs currently available for inference.
        Returns an empty list if no adapters are loaded or the provider
        does not support adapters.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    ...

LlamaCppProvider ¶

LlamaCppProvider(
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
)

Bases: InferenceProvider

InferenceProvider backed by llama-cpp-python with native LoRA support.

Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.

The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.

Parameters:

Name	Type	Description	Default
`model_path`	`str \| None`	Path to the GGUF model file.	`None`
`n_ctx`	`int`	Context window size. Default: 4096.	`4096`
`n_gpu_layers`	`int`	Layers to offload to GPU (-1 = all). Default: -1.	`-1`

Example

provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")

Initialize LlamaCppProvider with model configuration.

Parameters:

Name	Type	Description	Default
`model_path`	`str \| None`	Path to the GGUF model file.	`None`
`n_ctx`	`int`	Context window size. Default: 4096.	`4096`
`n_gpu_layers`	`int`	Layers to offload to GPU (-1 = all). Default: -1.	`-1`

Source code in libs/inference/src/inference/llamacpp_provider.py

def __init__(
    self,
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
) -> None:
    """Initialize LlamaCppProvider with model configuration.

    Args:
        model_path: Path to the GGUF model file.
        n_ctx: Context window size. Default: 4096.
        n_gpu_layers: Layers to offload to GPU (-1 = all). Default: -1.
    """
    self._model_path = model_path or ""
    self._n_ctx = n_ctx
    self._n_gpu_layers = n_gpu_layers
    self._llm: Any = None
    self._current_lora: str | None = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path

Functions¶

generate `async` ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using llama-cpp-python with optional LoRA adapter.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Ignored (model is set at construction via model_path).	required
`adapter_id`	`str \| None`	LoRA adapter ID to apply. Must be loaded via load_adapter() before use.	`None`
`max_tokens`	`int`	Maximum tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support).	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with generated text and metadata.

Raises:

Type	Description
`ValueError`	If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/llamacpp_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using llama-cpp-python with optional LoRA adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction via model_path).
        adapter_id: LoRA adapter ID to apply. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction. Prepended to
            the prompt for llama.cpp (no native chat template support).
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    lora_path: str | None = None
    if adapter_id:
        if adapter_id not in self._loaded_adapters:
            raise ValueError(
                f"Adapter '{adapter_id}' has not been loaded. "
                "Call load_adapter() first."
            )
        lora_path = self._loaded_adapters[adapter_id]

    self._load_model_if_needed(lora_path=lora_path)

    full_prompt = f"{system_prompt}\n\n{prompt}" if system_prompt else prompt
    response = self._llm(  # type: ignore[union-attr]
        full_prompt,
        max_tokens=max_tokens,
        stop=_STOP_SEQUENCES,
    )

    text = response["choices"][0]["text"]
    token_count = response["usage"]["total_tokens"]
    finish_reason = response["choices"][0].get("finish_reason", "stop")

    return GenerationResult(
        text=text,
        model=Path(self._model_path).stem,
        adapter_id=adapter_id,
        token_count=token_count,
        finish_reason=finish_reason,
    )

load_adapter `async` ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Register a LoRA adapter for use during generation.

The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter.	required
`adapter_path`	`str`	Filesystem path to the LoRA adapter weights (GGUF format).	required

Source code in libs/inference/src/inference/llamacpp_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a LoRA adapter for use during generation.

    The adapter is applied on next generate() call by reloading the model
    with the LoRA path. llama-cpp-python applies LoRA at model load time.

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Filesystem path to the LoRA adapter weights (GGUF format).
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)

unload_adapter `async` ¶

unload_adapter(adapter_id: str) -> None

Remove a registered LoRA adapter.

If the currently active adapter is unloaded, the model will reload without it on the next generate() call.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove.	required

Source code in libs/inference/src/inference/llamacpp_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered LoRA adapter.

    If the currently active adapter is unloaded, the model will reload
    without it on the next generate() call.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        was_active = self._current_lora == self._loaded_adapters[adapter_id]
        del self._loaded_adapters[adapter_id]
        if was_active:
            self._current_lora = None  # Force reload without adapter
        logger.info("Unloaded adapter %s", adapter_id)

list_adapters `async` ¶

list_adapters() -> list[str]

List all registered LoRA adapter IDs.

Returns:

Type	Description
`list[str]`	Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/llamacpp_provider.py

async def list_adapters(self) -> list[str]:
    """List all registered LoRA adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

OllamaProvider ¶

OllamaProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by an Ollama server.

Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.

Note

Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.

Attributes:

Name	Type	Description
`_client`		AsyncOpenAI client pointing at the Ollama server.

Example

provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")

Initialize OllamaProvider with an AsyncOpenAI client.

Parameters:

Name	Type	Description	Default
`base_url`	`str \| None`	Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1.	`None`

Source code in libs/inference/src/inference/ollama_provider.py

def __init__(self, base_url: str | None = None) -> None:
    """Initialize OllamaProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the Ollama server. Defaults to
            OLLAMA_BASE_URL env var or http://localhost:11434/v1.
    """
    self._base_url = base_url or OLLAMA_BASE_URL
    # Ollama requires a non-empty api_key but ignores its value.
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="ollama",
    )

Functions¶

generate `async` ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt using the base Ollama model.

If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	The Ollama model identifier (e.g. "qwen2.5-coder:7b").	required
`adapter_id`	`str \| None`	Ignored. If provided, a warning is logged.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with adapter_id=None (Ollama has no adapter concept).

Example

result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)

Source code in libs/inference/src/inference/ollama_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt using the base Ollama model.

    If adapter_id is provided, a warning is logged and it is ignored —
    Ollama does not support LoRA adapters. The base model is always used.

    Args:
        prompt: The user-facing input prompt.
        model: The Ollama model identifier (e.g. "qwen2.5-coder:7b").
        adapter_id: Ignored. If provided, a warning is logged.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with adapter_id=None (Ollama has no adapter concept).

    Example:
        >>> result = await provider.generate("def fib", model="qwen2.5-coder:7b")
        >>> print(result.text)
    """
    if adapter_id is not None:
        logger.warning(
            "OllamaProvider ignoring adapter_id=%s; "
            "Ollama does not support LoRA adapters.",
            adapter_id,
        )

    logger.debug("generate: model=%s max_tokens=%d", model, max_tokens)

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=None,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )

load_adapter `async` ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unused.	required
`adapter_path`	`str`	Unused.	required

Raises:

Type	Description
`UnsupportedOperationError`	Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Raises UnsupportedOperationError¶

Source code in libs/inference/src/inference/ollama_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.
        adapter_path: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter loading. Use VLLMProvider for adapter operations.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter loading. "
        "Use VLLMProvider for adapter operations."
    )

unload_adapter `async` ¶

unload_adapter(adapter_id: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unused.	required

Raises:

Type	Description
`UnsupportedOperationError`	Always — Ollama does not support LoRA adapter unloading.

Example

await provider.unload_adapter("adapter-001")

Raises UnsupportedOperationError¶

Source code in libs/inference/src/inference/ollama_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter unloading.

    Example:
        >>> await provider.unload_adapter("adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter unloading."
    )

list_adapters `async` ¶

list_adapters() -> list[str]

Return an empty list — Ollama has no adapter concept.

Returns:

Type	Description
`list[str]`	Always returns an empty list.

Example

adapters = await provider.list_adapters() print(adapters) # []

Source code in libs/inference/src/inference/ollama_provider.py

async def list_adapters(self) -> list[str]:
    """Return an empty list — Ollama has no adapter concept.

    Returns:
        Always returns an empty list.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # []
    """
    return []

TransformersProvider ¶

TransformersProvider(
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
)

Bases: InferenceProvider

InferenceProvider backed by HuggingFace transformers with PEFT LoRA.

Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID or local path.	`''`
`device`	`str`	Device to load model onto ('cpu', 'mps', 'cuda').	`'cpu'`
`torch_dtype`	`str`	Model dtype ('auto', 'float16', 'bfloat16').	`'auto'`

Example

provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")

Initialize TransformersProvider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID or local path.	`''`
`device`	`str`	Device to load model onto.	`'cpu'`
`torch_dtype`	`str`	Model dtype string.	`'auto'`

Source code in libs/inference/src/inference/transformers_provider.py

def __init__(
    self,
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
) -> None:
    """Initialize TransformersProvider.

    Args:
        model_name: HuggingFace model ID or local path.
        device: Device to load model onto.
        torch_dtype: Model dtype string.
    """
    self._model_name = model_name
    self._device = device
    self._torch_dtype = torch_dtype
    self._model: Any = None
    self._tokenizer: Any = None
    self._base_model: Any = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
    self._active_adapter: str | None = None
    self._is_peft_wrapped: bool = False

Functions¶

generate `async` ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using transformers with optional PEFT adapter.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Ignored (model is set at construction).	required
`adapter_id`	`str \| None`	LoRA adapter ID to activate. Must be loaded via load_adapter() before use.	`None`
`max_tokens`	`int`	Maximum tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction prepended via the tokenizer's chat template when available.	`None`
`temperature`	`float \| None`	Sampling temperature (default from pipeline config).	`None`
`top_p`	`float \| None`	Nucleus sampling threshold (default from pipeline config).	`None`
`repetition_penalty`	`float \| None`	Repetition penalty (default 1.0 = off).	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with generated text and metadata.

Raises:

Type	Description
`ValueError`	If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/transformers_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using transformers with optional PEFT adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction).
        adapter_id: LoRA adapter ID to activate. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction prepended via
            the tokenizer's chat template when available.
        temperature: Sampling temperature (default from pipeline config).
        top_p: Nucleus sampling threshold (default from pipeline config).
        repetition_penalty: Repetition penalty (default 1.0 = off).

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    import torch  # noqa: PLC0415

    self._load_model_if_needed()

    # Apply defaults from pipeline config
    if temperature is None:
        temperature = float(os.environ.get("RUNE_TEMPERATURE", "0.25"))
    if top_p is None:
        top_p = float(os.environ.get("RUNE_TOP_P", "0.9"))
    if repetition_penalty is None:
        repetition_penalty = float(
            os.environ.get("RUNE_REPETITION_PENALTY", "1.04")
        )

    # Validate adapter before switching
    if adapter_id and adapter_id not in self._loaded_adapters:
        raise ValueError(
            f"Adapter '{adapter_id}' has not been loaded. "
            "Call load_adapter() first."
        )

    # Switch adapter if needed
    if adapter_id and adapter_id != self._active_adapter:
        self._activate_adapter(adapter_id)
    elif not adapter_id and self._active_adapter:
        self._deactivate_adapter()

    # Build chat-formatted prompt via tokenizer's chat template
    formatted = self._format_prompt(prompt, system_prompt)
    inputs = self._tokenizer(
        formatted, return_tensors="pt", truncation=True, max_length=8192
    )
    inputs = {k: v.to(self._device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[1]

    gen_kwargs: dict[str, object] = {
        "max_new_tokens": max_tokens,
        "do_sample": temperature > 0,
        "temperature": max(temperature, 0.01),
        "top_p": top_p,
        "pad_token_id": self._tokenizer.pad_token_id,
    }
    if repetition_penalty > 1.0:
        gen_kwargs["repetition_penalty"] = repetition_penalty

    with torch.no_grad():
        outputs = self._model.generate(**inputs, **gen_kwargs)

    new_tokens = outputs[0][input_len:]
    text = self._tokenizer.decode(new_tokens, skip_special_tokens=True)
    total_tokens = outputs.shape[1]
    new_token_count = len(new_tokens)

    # Detect truncation: generated exactly max_tokens means cut off
    finish_reason = "length" if new_token_count >= max_tokens else "stop"

    return GenerationResult(
        text=text,
        model=self._model_name,
        adapter_id=self._active_adapter,
        token_count=total_tokens,
        finish_reason=finish_reason,
    )

load_adapter `async` ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Register a PEFT adapter directory for use during generation.

The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter.	required
`adapter_path`	`str`	Path to the PEFT adapter directory.	required

Source code in libs/inference/src/inference/transformers_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a PEFT adapter directory for use during generation.

    The adapter directory must contain adapter_model.safetensors and
    adapter_config.json as output by save_hypernetwork_adapter().

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Path to the PEFT adapter directory.
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)

unload_adapter `async` ¶

unload_adapter(adapter_id: str) -> None

Remove a registered adapter, freeing GPU memory.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove.	required

Source code in libs/inference/src/inference/transformers_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered adapter, freeing GPU memory.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        if self._active_adapter == adapter_id:
            self._deactivate_adapter()
        # Delete from PeftModel to free GPU memory
        if self._is_peft_wrapped and adapter_id in self._model.peft_config:
            self._model.delete_adapter(adapter_id)
        del self._loaded_adapters[adapter_id]
        # If no adapters remain, revert to base model
        if not self._loaded_adapters and self._is_peft_wrapped:
            self._model = self._base_model
            self._is_peft_wrapped = False
            logger.info("All adapters removed, reverted to base model")
        else:
            logger.info("Unloaded adapter %s", adapter_id)

list_adapters `async` ¶

list_adapters() -> list[str]

List all registered adapter IDs.

Returns:

Type	Description
`list[str]`	Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/transformers_provider.py

async def list_adapters(self) -> list[str]:
    """List all registered adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

VLLMProvider ¶

VLLMProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by a vLLM server with LoRA hot-loading support.

Communicates with vLLM via two channels

AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
httpx for LoRA adapter management (vLLM proprietary endpoints).

Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).

Attributes:

Name	Type	Description
`_client`		AsyncOpenAI client pointing at the vLLM server.
`_base_url`		Base URL string for constructing adapter management URLs.
`_loaded_adapters`	`set[str]`	Set of currently tracked adapter IDs.

Example

provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")

Initialize VLLMProvider with an AsyncOpenAI client.

Parameters:

Name	Type	Description	Default
`base_url`	`str \| None`	Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1.	`None`

Source code in libs/inference/src/inference/vllm_provider.py

def __init__(self, base_url: str | None = None) -> None:
    """Initialize VLLMProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the vLLM server. Defaults to
            VLLM_BASE_URL env var or http://localhost:8100/v1.
    """
    self._base_url = base_url or VLLM_BASE_URL
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="not-needed-for-local-vllm",
    )
    self._loaded_adapters: set[str] = set()

Functions¶

generate `async` ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt, optionally using a loaded LoRA adapter.

When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Base model identifier. Used as-is when no adapter is given.	required
`adapter_id`	`str \| None`	Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with the generated text and metadata.

Example

result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/vllm_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt, optionally using a loaded LoRA adapter.

    When adapter_id is provided, it is passed as the model parameter to
    the OpenAI API — this is how vLLM identifies and routes to loaded
    LoRA adapters (the adapter is referenced by its lora_name).

    Args:
        prompt: The user-facing input prompt.
        model: Base model identifier. Used as-is when no adapter is given.
        adapter_id: Name of a loaded LoRA adapter to apply. When set,
            this value replaces model in the API call.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with the generated text and metadata.

    Example:
        >>> result = await provider.generate("def fib", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    effective_model = adapter_id if adapter_id is not None else model
    logger.debug(
        "generate: model=%s adapter_id=%s max_tokens=%d",
        effective_model,
        adapter_id,
        max_tokens,
    )

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=effective_model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=adapter_id,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )

load_adapter `async` ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the vLLM server.

Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter (used as lora_name).	required
`adapter_path`	`str`	Filesystem path to the adapter weights directory.	required

Raises:

Type	Description
`HTTPStatusError`	If the vLLM server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the vLLM server.

    Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter
    to the internal tracking set on success.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/load_lora_adapter"
    logger.debug("load_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id, "lora_path": adapter_path},
        )
        response.raise_for_status()

    self._loaded_adapters.add(adapter_id)
    logger.info("Adapter loaded: %s", adapter_id)

unload_adapter `async` ¶

unload_adapter(adapter_id: str) -> None

Unload a LoRA adapter from the vLLM server.

Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Name of the adapter to unload.	required

Raises:

Type	Description
`HTTPStatusError`	If the vLLM server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a LoRA adapter from the vLLM server.

    Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the
    adapter from the internal tracking set.

    Args:
        adapter_id: Name of the adapter to unload.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/unload_lora_adapter"
    logger.debug("unload_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id},
        )
        response.raise_for_status()

    self._loaded_adapters.discard(adapter_id)
    logger.info("Adapter unloaded: %s", adapter_id)

list_adapters `async` ¶

list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).

Returns:

Type	Description
`list[str]`	Sorted list of adapter IDs currently tracked as loaded.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/vllm_provider.py

async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns the internal tracking set rather than querying vLLM to avoid
    the unreliable list endpoint (vLLM bug #11761).

    Returns:
        Sorted list of adapter IDs currently tracked as loaded.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    return sorted(self._loaded_adapters)

get_provider ¶

get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider

Return a cached InferenceProvider for the given backend.

Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.

Parameters:

Name	Type	Description	Default
`provider_type`	`str \| None`	One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm").	`None`
`base_url`	`str \| None`	Override URL for the backend server. If None, the per-backend default env var is used.	`None`

Returns:

Type	Description
`InferenceProvider`	A cached InferenceProvider instance for the requested backend.

Raises:

Type	Description
`ValueError`	If provider_type is not "vllm", "ollama", or "llamacpp".

Example

provider = get_provider("vllm") isinstance(provider, VLLMProvider) True

Source code in libs/inference/src/inference/factory.py

def get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider:
    """Return a cached InferenceProvider for the given backend.

    Resolves the provider type from the argument or the INFERENCE_PROVIDER
    env var (default: "vllm"). Resolves the base URL from the argument or
    the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances
    are cached by the (provider_type, base_url) tuple so repeated calls
    with the same arguments return the identical object.

    Args:
        provider_type: One of "vllm" or "ollama". If None, falls back to
            the INFERENCE_PROVIDER environment variable (default "vllm").
        base_url: Override URL for the backend server. If None, the
            per-backend default env var is used.

    Returns:
        A cached InferenceProvider instance for the requested backend.

    Raises:
        ValueError: If provider_type is not "vllm", "ollama", or "llamacpp".

    Example:
        >>> provider = get_provider("vllm")
        >>> isinstance(provider, VLLMProvider)
        True
    """
    ptype = (
        provider_type
        or os.environ.get("INFERENCE_PROVIDER", _DEFAULT_INFERENCE_PROVIDER)
    ).lower()

    resolved_url: str
    if ptype == "vllm":
        resolved_url = base_url or os.environ.get(
            "VLLM_BASE_URL", _DEFAULT_VLLM_BASE_URL
        )
    elif ptype == "ollama":
        resolved_url = base_url or os.environ.get(
            "OLLAMA_BASE_URL", _DEFAULT_OLLAMA_BASE_URL
        )
    elif ptype == "llamacpp":
        resolved_url = base_url or os.environ.get(
            "LLAMACPP_MODEL_PATH", _DEFAULT_LLAMACPP_MODEL_PATH
        )
    elif ptype == "transformers":
        resolved_url = base_url or os.environ.get("TRANSFORMERS_MODEL_NAME", "")
    else:
        raise ValueError(
            f"Unknown provider type: '{ptype}'. "
            "Supported values: 'vllm', 'ollama', 'llamacpp', 'transformers'."
        )

    cache_key = (ptype, resolved_url)
    if cache_key not in _provider_cache:
        if ptype == "vllm":
            from inference.vllm_provider import VLLMProvider

            _provider_cache[cache_key] = VLLMProvider(base_url=resolved_url)
        elif ptype == "llamacpp":
            from inference.llamacpp_provider import LlamaCppProvider

            _provider_cache[cache_key] = LlamaCppProvider(model_path=resolved_url)
        elif ptype == "transformers":
            from shared.hardware import get_best_device

            from inference.transformers_provider import TransformersProvider

            device = os.environ.get("TRANSFORMERS_DEVICE", get_best_device())
            _provider_cache[cache_key] = TransformersProvider(
                model_name=resolved_url, device=device
            )
        else:
            from inference.ollama_provider import OllamaProvider

            _provider_cache[cache_key] = OllamaProvider(base_url=resolved_url)

    return _provider_cache[cache_key]

get_provider_for_step ¶

get_provider_for_step(
    step_config: dict[str, str],
) -> InferenceProvider

Return a cached InferenceProvider configured from a step config dict.

Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.

Parameters:

Name	Type	Description	Default
`step_config`	`dict[str, str]`	Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server.	required

Returns:

Type	Description
`InferenceProvider`	A cached InferenceProvider instance for the step's backend.

Raises:

Type	Description
`ValueError`	If the provider type in step_config is not supported.

Example

provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True

Source code in libs/inference/src/inference/factory.py

def get_provider_for_step(step_config: dict[str, str]) -> InferenceProvider:
    """Return a cached InferenceProvider configured from a step config dict.

    Reads "provider" and optionally "base_url" from the step config and
    delegates to get_provider(). Designed for use by the agent loop where
    each pipeline step may specify its own provider and server URL.

    Args:
        step_config: Dict with optional keys:
            - "provider": Provider type ("vllm" or "ollama").
            - "base_url": Override URL for the backend server.

    Returns:
        A cached InferenceProvider instance for the step's backend.

    Raises:
        ValueError: If the provider type in step_config is not supported.

    Example:
        >>> provider = get_provider_for_step({"provider": "ollama"})
        >>> isinstance(provider, OllamaProvider)
        True
    """
    return get_provider(
        provider_type=step_config.get("provider"),
        base_url=step_config.get("base_url"),
    )

exceptions ¶

Custom exceptions for the inference library.

Classes¶

UnsupportedOperationError ¶

Bases: Exception

Raised when a provider does not support the requested operation.

Used primarily by OllamaProvider to signal that LoRA adapter operations are not available for Ollama-based inference.

Example

raise UnsupportedOperationError("OllamaProvider does not support adapters.")

factory ¶

Provider factory with instance cache for the inference library.

Selects between VLLMProvider and OllamaProvider based on configuration, caching instances by (provider_type, base_url) to avoid redundant construction.

Classes¶

Functions¶

get_provider ¶

get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider

Return a cached InferenceProvider for the given backend.

Resolves the provider type from the argument or the INFERENCE_PROVIDER env var (default: "vllm"). Resolves the base URL from the argument or the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances are cached by the (provider_type, base_url) tuple so repeated calls with the same arguments return the identical object.

Parameters:

Name	Type	Description	Default
`provider_type`	`str \| None`	One of "vllm" or "ollama". If None, falls back to the INFERENCE_PROVIDER environment variable (default "vllm").	`None`
`base_url`	`str \| None`	Override URL for the backend server. If None, the per-backend default env var is used.	`None`

Returns:

Type	Description
`InferenceProvider`	A cached InferenceProvider instance for the requested backend.

Raises:

Type	Description
`ValueError`	If provider_type is not "vllm", "ollama", or "llamacpp".

Example

provider = get_provider("vllm") isinstance(provider, VLLMProvider) True

Source code in libs/inference/src/inference/factory.py

def get_provider(
    provider_type: str | None = None,
    base_url: str | None = None,
) -> InferenceProvider:
    """Return a cached InferenceProvider for the given backend.

    Resolves the provider type from the argument or the INFERENCE_PROVIDER
    env var (default: "vllm"). Resolves the base URL from the argument or
    the per-backend env var (VLLM_BASE_URL / OLLAMA_BASE_URL). Instances
    are cached by the (provider_type, base_url) tuple so repeated calls
    with the same arguments return the identical object.

    Args:
        provider_type: One of "vllm" or "ollama". If None, falls back to
            the INFERENCE_PROVIDER environment variable (default "vllm").
        base_url: Override URL for the backend server. If None, the
            per-backend default env var is used.

    Returns:
        A cached InferenceProvider instance for the requested backend.

    Raises:
        ValueError: If provider_type is not "vllm", "ollama", or "llamacpp".

    Example:
        >>> provider = get_provider("vllm")
        >>> isinstance(provider, VLLMProvider)
        True
    """
    ptype = (
        provider_type
        or os.environ.get("INFERENCE_PROVIDER", _DEFAULT_INFERENCE_PROVIDER)
    ).lower()

    resolved_url: str
    if ptype == "vllm":
        resolved_url = base_url or os.environ.get(
            "VLLM_BASE_URL", _DEFAULT_VLLM_BASE_URL
        )
    elif ptype == "ollama":
        resolved_url = base_url or os.environ.get(
            "OLLAMA_BASE_URL", _DEFAULT_OLLAMA_BASE_URL
        )
    elif ptype == "llamacpp":
        resolved_url = base_url or os.environ.get(
            "LLAMACPP_MODEL_PATH", _DEFAULT_LLAMACPP_MODEL_PATH
        )
    elif ptype == "transformers":
        resolved_url = base_url or os.environ.get("TRANSFORMERS_MODEL_NAME", "")
    else:
        raise ValueError(
            f"Unknown provider type: '{ptype}'. "
            "Supported values: 'vllm', 'ollama', 'llamacpp', 'transformers'."
        )

    cache_key = (ptype, resolved_url)
    if cache_key not in _provider_cache:
        if ptype == "vllm":
            from inference.vllm_provider import VLLMProvider

            _provider_cache[cache_key] = VLLMProvider(base_url=resolved_url)
        elif ptype == "llamacpp":
            from inference.llamacpp_provider import LlamaCppProvider

            _provider_cache[cache_key] = LlamaCppProvider(model_path=resolved_url)
        elif ptype == "transformers":
            from shared.hardware import get_best_device

            from inference.transformers_provider import TransformersProvider

            device = os.environ.get("TRANSFORMERS_DEVICE", get_best_device())
            _provider_cache[cache_key] = TransformersProvider(
                model_name=resolved_url, device=device
            )
        else:
            from inference.ollama_provider import OllamaProvider

            _provider_cache[cache_key] = OllamaProvider(base_url=resolved_url)

    return _provider_cache[cache_key]

get_provider_for_step ¶

get_provider_for_step(
    step_config: dict[str, str],
) -> InferenceProvider

Return a cached InferenceProvider configured from a step config dict.

Reads "provider" and optionally "base_url" from the step config and delegates to get_provider(). Designed for use by the agent loop where each pipeline step may specify its own provider and server URL.

Parameters:

Name	Type	Description	Default
`step_config`	`dict[str, str]`	Dict with optional keys: - "provider": Provider type ("vllm" or "ollama"). - "base_url": Override URL for the backend server.	required

Returns:

Type	Description
`InferenceProvider`	A cached InferenceProvider instance for the step's backend.

Raises:

Type	Description
`ValueError`	If the provider type in step_config is not supported.

Example

provider = get_provider_for_step({"provider": "ollama"}) isinstance(provider, OllamaProvider) True

Source code in libs/inference/src/inference/factory.py

def get_provider_for_step(step_config: dict[str, str]) -> InferenceProvider:
    """Return a cached InferenceProvider configured from a step config dict.

    Reads "provider" and optionally "base_url" from the step config and
    delegates to get_provider(). Designed for use by the agent loop where
    each pipeline step may specify its own provider and server URL.

    Args:
        step_config: Dict with optional keys:
            - "provider": Provider type ("vllm" or "ollama").
            - "base_url": Override URL for the backend server.

    Returns:
        A cached InferenceProvider instance for the step's backend.

    Raises:
        ValueError: If the provider type in step_config is not supported.

    Example:
        >>> provider = get_provider_for_step({"provider": "ollama"})
        >>> isinstance(provider, OllamaProvider)
        True
    """
    return get_provider(
        provider_type=step_config.get("provider"),
        base_url=step_config.get("base_url"),
    )

llamacpp_provider ¶

LlamaCppProvider: InferenceProvider using llama-cpp-python with LoRA support.

Loads GGUF models via llama_cpp.Llama with optional LoRA adapter paths. Designed for Apple Silicon (Metal) local inference where adapter hot-loading is needed — Ollama cannot load LoRA adapters, and vLLM requires a server.

IMPORTANT: llama_cpp is imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI without llama-cpp-python.

Classes¶

LlamaCppProvider ¶

LlamaCppProvider(
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
)

Bases: InferenceProvider

InferenceProvider backed by llama-cpp-python with native LoRA support.

Unlike OllamaProvider, this loads GGUF models directly and can apply LoRA adapters at load time. Unlike VLLMProvider, no server is needed.

The model is loaded lazily on first generate() call. When an adapter is loaded, the model is reloaded with the LoRA path applied.

Parameters:

Name	Type	Description	Default
`model_path`	`str \| None`	Path to the GGUF model file.	`None`
`n_ctx`	`int`	Context window size. Default: 4096.	`4096`
`n_gpu_layers`	`int`	Layers to offload to GPU (-1 = all). Default: -1.	`-1`

Example

provider = LlamaCppProvider(model_path="/models/qwen2.5-coder-1.5b.gguf") result = await provider.generate("def hello", model="ignored")

Initialize LlamaCppProvider with model configuration.

Parameters:

Name	Type	Description	Default
`model_path`	`str \| None`	Path to the GGUF model file.	`None`
`n_ctx`	`int`	Context window size. Default: 4096.	`4096`
`n_gpu_layers`	`int`	Layers to offload to GPU (-1 = all). Default: -1.	`-1`

Source code in libs/inference/src/inference/llamacpp_provider.py

def __init__(
    self,
    model_path: str | None = None,
    n_ctx: int = 4096,
    n_gpu_layers: int = -1,
) -> None:
    """Initialize LlamaCppProvider with model configuration.

    Args:
        model_path: Path to the GGUF model file.
        n_ctx: Context window size. Default: 4096.
        n_gpu_layers: Layers to offload to GPU (-1 = all). Default: -1.
    """
    self._model_path = model_path or ""
    self._n_ctx = n_ctx
    self._n_gpu_layers = n_gpu_layers
    self._llm: Any = None
    self._current_lora: str | None = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path

Functions¶

generate async ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using llama-cpp-python with optional LoRA adapter.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Ignored (model is set at construction via model_path).	required
`adapter_id`	`str \| None`	LoRA adapter ID to apply. Must be loaded via load_adapter() before use.	`None`
`max_tokens`	`int`	Maximum tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction. Prepended to the prompt for llama.cpp (no native chat template support).	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with generated text and metadata.

Raises:

Type	Description
`ValueError`	If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/llamacpp_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using llama-cpp-python with optional LoRA adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction via model_path).
        adapter_id: LoRA adapter ID to apply. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction. Prepended to
            the prompt for llama.cpp (no native chat template support).
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    lora_path: str | None = None
    if adapter_id:
        if adapter_id not in self._loaded_adapters:
            raise ValueError(
                f"Adapter '{adapter_id}' has not been loaded. "
                "Call load_adapter() first."
            )
        lora_path = self._loaded_adapters[adapter_id]

    self._load_model_if_needed(lora_path=lora_path)

    full_prompt = f"{system_prompt}\n\n{prompt}" if system_prompt else prompt
    response = self._llm(  # type: ignore[union-attr]
        full_prompt,
        max_tokens=max_tokens,
        stop=_STOP_SEQUENCES,
    )

    text = response["choices"][0]["text"]
    token_count = response["usage"]["total_tokens"]
    finish_reason = response["choices"][0].get("finish_reason", "stop")

    return GenerationResult(
        text=text,
        model=Path(self._model_path).stem,
        adapter_id=adapter_id,
        token_count=token_count,
        finish_reason=finish_reason,
    )

load_adapter async ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Register a LoRA adapter for use during generation.

The adapter is applied on next generate() call by reloading the model with the LoRA path. llama-cpp-python applies LoRA at model load time.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter.	required
`adapter_path`	`str`	Filesystem path to the LoRA adapter weights (GGUF format).	required

Source code in libs/inference/src/inference/llamacpp_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a LoRA adapter for use during generation.

    The adapter is applied on next generate() call by reloading the model
    with the LoRA path. llama-cpp-python applies LoRA at model load time.

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Filesystem path to the LoRA adapter weights (GGUF format).
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)

unload_adapter async ¶

unload_adapter(adapter_id: str) -> None

Remove a registered LoRA adapter.

If the currently active adapter is unloaded, the model will reload without it on the next generate() call.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove.	required

Source code in libs/inference/src/inference/llamacpp_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered LoRA adapter.

    If the currently active adapter is unloaded, the model will reload
    without it on the next generate() call.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        was_active = self._current_lora == self._loaded_adapters[adapter_id]
        del self._loaded_adapters[adapter_id]
        if was_active:
            self._current_lora = None  # Force reload without adapter
        logger.info("Unloaded adapter %s", adapter_id)

list_adapters async ¶

list_adapters() -> list[str]

List all registered LoRA adapter IDs.

Returns:

Type	Description
`list[str]`	Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/llamacpp_provider.py

async def list_adapters(self) -> list[str]:
    """List all registered LoRA adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

ollama_provider ¶

OllamaProvider: InferenceProvider implementation backed by an Ollama server.

Uses Ollama's OpenAI-compatible endpoint for generation. Adapter operations raise UnsupportedOperationError since Ollama has no LoRA adapter concept.

Classes¶

OllamaProvider ¶

OllamaProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by an Ollama server.

Uses Ollama's OpenAI-compatible API (/v1/chat/completions) for generation, keeping the HTTP layer symmetrical with VLLMProvider. Adapter operations are not supported — calling them raises UnsupportedOperationError.

Note

Ollama requires a non-empty api_key but ignores its value. The string "ollama" is used by convention.

Attributes:

Name	Type	Description
`_client`		AsyncOpenAI client pointing at the Ollama server.

Example

provider = OllamaProvider(base_url="http://localhost:11434/v1") result = await provider.generate("def hello", model="qwen2.5-coder:7b")

Initialize OllamaProvider with an AsyncOpenAI client.

Parameters:

Name	Type	Description	Default
`base_url`	`str \| None`	Override URL for the Ollama server. Defaults to OLLAMA_BASE_URL env var or http://localhost:11434/v1.	`None`

Source code in libs/inference/src/inference/ollama_provider.py

def __init__(self, base_url: str | None = None) -> None:
    """Initialize OllamaProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the Ollama server. Defaults to
            OLLAMA_BASE_URL env var or http://localhost:11434/v1.
    """
    self._base_url = base_url or OLLAMA_BASE_URL
    # Ollama requires a non-empty api_key but ignores its value.
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="ollama",
    )

Functions¶

generate async ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt using the base Ollama model.

If adapter_id is provided, a warning is logged and it is ignored — Ollama does not support LoRA adapters. The base model is always used.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	The Ollama model identifier (e.g. "qwen2.5-coder:7b").	required
`adapter_id`	`str \| None`	Ignored. If provided, a warning is logged.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with adapter_id=None (Ollama has no adapter concept).

Example

result = await provider.generate("def fib", model="qwen2.5-coder:7b") print(result.text)

Source code in libs/inference/src/inference/ollama_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt using the base Ollama model.

    If adapter_id is provided, a warning is logged and it is ignored —
    Ollama does not support LoRA adapters. The base model is always used.

    Args:
        prompt: The user-facing input prompt.
        model: The Ollama model identifier (e.g. "qwen2.5-coder:7b").
        adapter_id: Ignored. If provided, a warning is logged.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with adapter_id=None (Ollama has no adapter concept).

    Example:
        >>> result = await provider.generate("def fib", model="qwen2.5-coder:7b")
        >>> print(result.text)
    """
    if adapter_id is not None:
        logger.warning(
            "OllamaProvider ignoring adapter_id=%s; "
            "Ollama does not support LoRA adapters.",
            adapter_id,
        )

    logger.debug("generate: model=%s max_tokens=%d", model, max_tokens)

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=None,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )

load_adapter async ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unused.	required
`adapter_path`	`str`	Unused.	required

Raises:

Type	Description
`UnsupportedOperationError`	Always — Ollama does not support LoRA adapter loading. Use VLLMProvider for adapter operations.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Raises UnsupportedOperationError¶

Source code in libs/inference/src/inference/ollama_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.
        adapter_path: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter loading. Use VLLMProvider for adapter operations.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter loading. "
        "Use VLLMProvider for adapter operations."
    )

unload_adapter async ¶

unload_adapter(adapter_id: str) -> None

Not supported by Ollama. Always raises UnsupportedOperationError.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unused.	required

Raises:

Type	Description
`UnsupportedOperationError`	Always — Ollama does not support LoRA adapter unloading.

Example

await provider.unload_adapter("adapter-001")

Raises UnsupportedOperationError¶

Source code in libs/inference/src/inference/ollama_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Not supported by Ollama. Always raises UnsupportedOperationError.

    Args:
        adapter_id: Unused.

    Raises:
        UnsupportedOperationError: Always — Ollama does not support
            LoRA adapter unloading.

    Example:
        >>> await provider.unload_adapter("adapter-001")
        # Raises UnsupportedOperationError
    """
    raise UnsupportedOperationError(
        "OllamaProvider does not support LoRA adapter unloading."
    )

list_adapters async ¶

list_adapters() -> list[str]

Return an empty list — Ollama has no adapter concept.

Returns:

Type	Description
`list[str]`	Always returns an empty list.

Example

adapters = await provider.list_adapters() print(adapters) # []

Source code in libs/inference/src/inference/ollama_provider.py

async def list_adapters(self) -> list[str]:
    """Return an empty list — Ollama has no adapter concept.

    Returns:
        Always returns an empty list.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # []
    """
    return []

provider ¶

Abstract base class and shared types for inference providers.

Defines the provider-agnostic API that the agent loop consumes. Concrete implementations (VLLMProvider, OllamaProvider) fulfil this interface for their respective backends.

Classes¶

GenerationResult `dataclass` ¶

GenerationResult(
    text: str,
    model: str,
    adapter_id: str | None,
    token_count: int,
    finish_reason: str,
)

Structured result returned by InferenceProvider.generate().

Attributes:

Name	Type	Description
`text`	`str`	The generated text output from the model.
`model`	`str`	The model identifier used for generation.
`adapter_id`	`str \| None`	The LoRA adapter applied during generation, or None if no adapter was used.
`token_count`	`int`	Total number of tokens consumed (prompt + completion).
`finish_reason`	`str`	Reason generation stopped (e.g. "stop", "length").

Example

result = GenerationResult( ... text="def hello(): pass", ... model="Qwen/Qwen2.5-Coder-7B", ... adapter_id=None, ... token_count=10, ... finish_reason="stop", ... )

InferenceProvider ¶

Bases: ABC

Abstract base class for inference providers.

Defines a provider-agnostic API for text generation and LoRA adapter lifecycle management. All methods are async because every provider communicates over HTTP.

Concrete implementations

VLLMProvider: Full LoRA support via vLLM's dynamic loading API.
OllamaProvider: Base-model inference only; adapter ops raise UnsupportedOperationError.

Functions¶

generate abstractmethod async ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	The model identifier to use for generation.	required
`adapter_id`	`str \| None`	Optional LoRA adapter to apply during generation. If None, uses the base model directly.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction. Providers that support chat templates will format this as a system message.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	A GenerationResult containing the generated text and metadata.

Example

result = await provider.generate("def hello", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt.

    Args:
        prompt: The user-facing input prompt.
        model: The model identifier to use for generation.
        adapter_id: Optional LoRA adapter to apply during generation.
            If None, uses the base model directly.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction. Providers that
            support chat templates will format this as a system message.

        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        A GenerationResult containing the generated text and metadata.

    Example:
        >>> result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    ...

load_adapter abstractmethod async ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the inference server.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter (used as lora_name in vLLM).	required
`adapter_path`	`str`	Filesystem path to the adapter weights directory.	required

Raises:

Type	Description
`UnsupportedOperationError`	If the provider does not support adapters.
`HTTPStatusError`	If the server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the inference server.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name in vLLM).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    ...

unload_adapter abstractmethod async ¶

unload_adapter(adapter_id: str) -> None

Unload a previously loaded LoRA adapter.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove from the server.	required

Raises:

Type	Description
`UnsupportedOperationError`	If the provider does not support adapters.
`HTTPStatusError`	If the server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a previously loaded LoRA adapter.

    Args:
        adapter_id: The adapter name to remove from the server.

    Raises:
        UnsupportedOperationError: If the provider does not support adapters.
        httpx.HTTPStatusError: If the server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    ...

list_adapters abstractmethod async ¶

list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns:

Type	Description
`list[str]`	Sorted list of adapter IDs currently available for inference.
`list[str]`	Returns an empty list if no adapters are loaded or the provider
`list[str]`	does not support adapters.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/provider.py

@abstractmethod
async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns:
        Sorted list of adapter IDs currently available for inference.
        Returns an empty list if no adapters are loaded or the provider
        does not support adapters.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    ...

transformers_provider ¶

TransformersProvider: InferenceProvider using HuggingFace transformers + PEFT.

Loads models via AutoModelForCausalLM and applies LoRA adapters via PEFT. This is the only provider that natively supports PEFT-format adapters (safetensors) as output by the hypernetwork.

IMPORTANT: transformers, torch, and peft are imported inside method bodies per INFRA-05 pattern so that this module is importable in CPU-only CI.

Classes¶

TransformersProvider ¶

TransformersProvider(
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
)

Bases: InferenceProvider

InferenceProvider backed by HuggingFace transformers with PEFT LoRA.

Loads models locally via AutoModelForCausalLM. Adapters are applied via PEFT's PeftModel, which natively reads the safetensors format output by the hypernetwork.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID or local path.	`''`
`device`	`str`	Device to load model onto ('cpu', 'mps', 'cuda').	`'cpu'`
`torch_dtype`	`str`	Model dtype ('auto', 'float16', 'bfloat16').	`'auto'`

Example

provider = TransformersProvider(model_name="Qwen/Qwen2.5-Coder-0.5B") result = await provider.generate("def hello", model="ignored")

Initialize TransformersProvider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID or local path.	`''`
`device`	`str`	Device to load model onto.	`'cpu'`
`torch_dtype`	`str`	Model dtype string.	`'auto'`

Source code in libs/inference/src/inference/transformers_provider.py

def __init__(
    self,
    model_name: str = "",
    device: str = "cpu",
    torch_dtype: str = "auto",
) -> None:
    """Initialize TransformersProvider.

    Args:
        model_name: HuggingFace model ID or local path.
        device: Device to load model onto.
        torch_dtype: Model dtype string.
    """
    self._model_name = model_name
    self._device = device
    self._torch_dtype = torch_dtype
    self._model: Any = None
    self._tokenizer: Any = None
    self._base_model: Any = None
    self._loaded_adapters: dict[str, str] = {}  # id -> path
    self._active_adapter: str | None = None
    self._is_peft_wrapped: bool = False

Functions¶

generate async ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text using transformers with optional PEFT adapter.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Ignored (model is set at construction).	required
`adapter_id`	`str \| None`	LoRA adapter ID to activate. Must be loaded via load_adapter() before use.	`None`
`max_tokens`	`int`	Maximum tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction prepended via the tokenizer's chat template when available.	`None`
`temperature`	`float \| None`	Sampling temperature (default from pipeline config).	`None`
`top_p`	`float \| None`	Nucleus sampling threshold (default from pipeline config).	`None`
`repetition_penalty`	`float \| None`	Repetition penalty (default 1.0 = off).	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with generated text and metadata.

Raises:

Type	Description
`ValueError`	If adapter_id is provided but has not been loaded.

Source code in libs/inference/src/inference/transformers_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text using transformers with optional PEFT adapter.

    Args:
        prompt: The user-facing input prompt.
        model: Ignored (model is set at construction).
        adapter_id: LoRA adapter ID to activate. Must be loaded via
            load_adapter() before use.
        max_tokens: Maximum tokens to generate.
        system_prompt: Optional system-level instruction prepended via
            the tokenizer's chat template when available.
        temperature: Sampling temperature (default from pipeline config).
        top_p: Nucleus sampling threshold (default from pipeline config).
        repetition_penalty: Repetition penalty (default 1.0 = off).

    Returns:
        GenerationResult with generated text and metadata.

    Raises:
        ValueError: If adapter_id is provided but has not been loaded.
    """
    import torch  # noqa: PLC0415

    self._load_model_if_needed()

    # Apply defaults from pipeline config
    if temperature is None:
        temperature = float(os.environ.get("RUNE_TEMPERATURE", "0.25"))
    if top_p is None:
        top_p = float(os.environ.get("RUNE_TOP_P", "0.9"))
    if repetition_penalty is None:
        repetition_penalty = float(
            os.environ.get("RUNE_REPETITION_PENALTY", "1.04")
        )

    # Validate adapter before switching
    if adapter_id and adapter_id not in self._loaded_adapters:
        raise ValueError(
            f"Adapter '{adapter_id}' has not been loaded. "
            "Call load_adapter() first."
        )

    # Switch adapter if needed
    if adapter_id and adapter_id != self._active_adapter:
        self._activate_adapter(adapter_id)
    elif not adapter_id and self._active_adapter:
        self._deactivate_adapter()

    # Build chat-formatted prompt via tokenizer's chat template
    formatted = self._format_prompt(prompt, system_prompt)
    inputs = self._tokenizer(
        formatted, return_tensors="pt", truncation=True, max_length=8192
    )
    inputs = {k: v.to(self._device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[1]

    gen_kwargs: dict[str, object] = {
        "max_new_tokens": max_tokens,
        "do_sample": temperature > 0,
        "temperature": max(temperature, 0.01),
        "top_p": top_p,
        "pad_token_id": self._tokenizer.pad_token_id,
    }
    if repetition_penalty > 1.0:
        gen_kwargs["repetition_penalty"] = repetition_penalty

    with torch.no_grad():
        outputs = self._model.generate(**inputs, **gen_kwargs)

    new_tokens = outputs[0][input_len:]
    text = self._tokenizer.decode(new_tokens, skip_special_tokens=True)
    total_tokens = outputs.shape[1]
    new_token_count = len(new_tokens)

    # Detect truncation: generated exactly max_tokens means cut off
    finish_reason = "length" if new_token_count >= max_tokens else "stop"

    return GenerationResult(
        text=text,
        model=self._model_name,
        adapter_id=self._active_adapter,
        token_count=total_tokens,
        finish_reason=finish_reason,
    )

load_adapter async ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Register a PEFT adapter directory for use during generation.

The adapter directory must contain adapter_model.safetensors and adapter_config.json as output by save_hypernetwork_adapter().

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter.	required
`adapter_path`	`str`	Path to the PEFT adapter directory.	required

Source code in libs/inference/src/inference/transformers_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Register a PEFT adapter directory for use during generation.

    The adapter directory must contain adapter_model.safetensors and
    adapter_config.json as output by save_hypernetwork_adapter().

    Args:
        adapter_id: Unique name for the adapter.
        adapter_path: Path to the PEFT adapter directory.
    """
    self._loaded_adapters[adapter_id] = adapter_path
    logger.info("Registered adapter %s -> %s", adapter_id, adapter_path)

unload_adapter async ¶

unload_adapter(adapter_id: str) -> None

Remove a registered adapter, freeing GPU memory.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	The adapter name to remove.	required

Source code in libs/inference/src/inference/transformers_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Remove a registered adapter, freeing GPU memory.

    Args:
        adapter_id: The adapter name to remove.
    """
    if adapter_id in self._loaded_adapters:
        if self._active_adapter == adapter_id:
            self._deactivate_adapter()
        # Delete from PeftModel to free GPU memory
        if self._is_peft_wrapped and adapter_id in self._model.peft_config:
            self._model.delete_adapter(adapter_id)
        del self._loaded_adapters[adapter_id]
        # If no adapters remain, revert to base model
        if not self._loaded_adapters and self._is_peft_wrapped:
            self._model = self._base_model
            self._is_peft_wrapped = False
            logger.info("All adapters removed, reverted to base model")
        else:
            logger.info("Unloaded adapter %s", adapter_id)

list_adapters async ¶

list_adapters() -> list[str]

List all registered adapter IDs.

Returns:

Type	Description
`list[str]`	Sorted list of registered adapter IDs.

Source code in libs/inference/src/inference/transformers_provider.py

async def list_adapters(self) -> list[str]:
    """List all registered adapter IDs.

    Returns:
        Sorted list of registered adapter IDs.
    """
    return sorted(self._loaded_adapters.keys())

vllm_provider ¶

VLLMProvider: InferenceProvider implementation backed by a vLLM server.

Uses the OpenAI-compatible API for generation and vLLM's proprietary LoRA management endpoints for hot-loading adapters at runtime.

Classes¶

VLLMProvider ¶

VLLMProvider(base_url: str | None = None)

Bases: InferenceProvider

InferenceProvider backed by a vLLM server with LoRA hot-loading support.

Communicates with vLLM via two channels

AsyncOpenAI SDK for generation (OpenAI-compatible endpoint).
httpx for LoRA adapter management (vLLM proprietary endpoints).

Adapter tracking is maintained in an internal set to work around vLLM bug #11761 (list_lora_adapters unreliable after concurrent loads).

Attributes:

Name	Type	Description
`_client`		AsyncOpenAI client pointing at the vLLM server.
`_base_url`		Base URL string for constructing adapter management URLs.
`_loaded_adapters`	`set[str]`	Set of currently tracked adapter IDs.

Example

provider = VLLMProvider(base_url="http://localhost:8100/v1") result = await provider.generate("def hello", model="Qwen2.5-Coder-7B")

Initialize VLLMProvider with an AsyncOpenAI client.

Parameters:

Name	Type	Description	Default
`base_url`	`str \| None`	Override URL for the vLLM server. Defaults to VLLM_BASE_URL env var or http://localhost:8100/v1.	`None`

Source code in libs/inference/src/inference/vllm_provider.py

def __init__(self, base_url: str | None = None) -> None:
    """Initialize VLLMProvider with an AsyncOpenAI client.

    Args:
        base_url: Override URL for the vLLM server. Defaults to
            VLLM_BASE_URL env var or http://localhost:8100/v1.
    """
    self._base_url = base_url or VLLM_BASE_URL
    self._client = AsyncOpenAI(
        base_url=self._base_url,
        api_key="not-needed-for-local-vllm",
    )
    self._loaded_adapters: set[str] = set()

Functions¶

generate async ¶

generate(
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult

Generate text from a prompt, optionally using a loaded LoRA adapter.

When adapter_id is provided, it is passed as the model parameter to the OpenAI API — this is how vLLM identifies and routes to loaded LoRA adapters (the adapter is referenced by its lora_name).

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The user-facing input prompt.	required
`model`	`str`	Base model identifier. Used as-is when no adapter is given.	required
`adapter_id`	`str \| None`	Name of a loaded LoRA adapter to apply. When set, this value replaces model in the API call.	`None`
`max_tokens`	`int`	Maximum number of tokens to generate.	`4096`
`system_prompt`	`str \| None`	Optional system-level instruction.	`None`
`temperature`	`float \| None`	Sampling temperature override.	`None`
`top_p`	`float \| None`	Nucleus sampling threshold override.	`None`
`repetition_penalty`	`float \| None`	Repetition penalty override.	`None`

Returns:

Type	Description
`GenerationResult`	GenerationResult with the generated text and metadata.

Example

result = await provider.generate("def fib", model="Qwen2.5-Coder-7B") print(result.text)

Source code in libs/inference/src/inference/vllm_provider.py

async def generate(
    self,
    prompt: str,
    model: str,
    adapter_id: str | None = None,
    max_tokens: int = 4096,
    system_prompt: str | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    repetition_penalty: float | None = None,
) -> GenerationResult:
    """Generate text from a prompt, optionally using a loaded LoRA adapter.

    When adapter_id is provided, it is passed as the model parameter to
    the OpenAI API — this is how vLLM identifies and routes to loaded
    LoRA adapters (the adapter is referenced by its lora_name).

    Args:
        prompt: The user-facing input prompt.
        model: Base model identifier. Used as-is when no adapter is given.
        adapter_id: Name of a loaded LoRA adapter to apply. When set,
            this value replaces model in the API call.
        max_tokens: Maximum number of tokens to generate.
        system_prompt: Optional system-level instruction.
        temperature: Sampling temperature override.
        top_p: Nucleus sampling threshold override.
        repetition_penalty: Repetition penalty override.

    Returns:
        GenerationResult with the generated text and metadata.

    Example:
        >>> result = await provider.generate("def fib", model="Qwen2.5-Coder-7B")
        >>> print(result.text)
    """
    effective_model = adapter_id if adapter_id is not None else model
    logger.debug(
        "generate: model=%s adapter_id=%s max_tokens=%d",
        effective_model,
        adapter_id,
        max_tokens,
    )

    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    response = await self._client.chat.completions.create(
        model=effective_model,
        messages=messages,  # type: ignore[arg-type]
        max_tokens=max_tokens,
    )

    choice = response.choices[0]
    return GenerationResult(
        text=choice.message.content or "",
        model=response.model,
        adapter_id=adapter_id,
        token_count=response.usage.total_tokens if response.usage else 0,
        finish_reason=choice.finish_reason or "stop",
    )

load_adapter async ¶

load_adapter(adapter_id: str, adapter_path: str) -> None

Load a LoRA adapter into the vLLM server.

Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter to the internal tracking set on success.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Unique name for the adapter (used as lora_name).	required
`adapter_path`	`str`	Filesystem path to the adapter weights directory.	required

Raises:

Type	Description
`HTTPStatusError`	If the vLLM server returns an error response.

Example

await provider.load_adapter("adapter-001", "/models/adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py

async def load_adapter(self, adapter_id: str, adapter_path: str) -> None:
    """Load a LoRA adapter into the vLLM server.

    Posts to vLLM's /v1/load_lora_adapter endpoint and adds the adapter
    to the internal tracking set on success.

    Args:
        adapter_id: Unique name for the adapter (used as lora_name).
        adapter_path: Filesystem path to the adapter weights directory.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.load_adapter("adapter-001", "/models/adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/load_lora_adapter"
    logger.debug("load_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id, "lora_path": adapter_path},
        )
        response.raise_for_status()

    self._loaded_adapters.add(adapter_id)
    logger.info("Adapter loaded: %s", adapter_id)

unload_adapter async ¶

unload_adapter(adapter_id: str) -> None

Unload a LoRA adapter from the vLLM server.

Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the adapter from the internal tracking set.

Parameters:

Name	Type	Description	Default
`adapter_id`	`str`	Name of the adapter to unload.	required

Raises:

Type	Description
`HTTPStatusError`	If the vLLM server returns an error response.

Example

await provider.unload_adapter("adapter-001")

Source code in libs/inference/src/inference/vllm_provider.py

async def unload_adapter(self, adapter_id: str) -> None:
    """Unload a LoRA adapter from the vLLM server.

    Posts to vLLM's /v1/unload_lora_adapter endpoint and removes the
    adapter from the internal tracking set.

    Args:
        adapter_id: Name of the adapter to unload.

    Raises:
        httpx.HTTPStatusError: If the vLLM server returns an error response.

    Example:
        >>> await provider.unload_adapter("adapter-001")
    """
    url = f"{self._base_url.rstrip('/')}/unload_lora_adapter"
    logger.debug("unload_adapter: POST %s lora_name=%s", url, adapter_id)

    async with httpx.AsyncClient() as client:
        response = await client.post(
            url,
            json={"lora_name": adapter_id},
        )
        response.raise_for_status()

    self._loaded_adapters.discard(adapter_id)
    logger.info("Adapter unloaded: %s", adapter_id)

list_adapters async ¶

list_adapters() -> list[str]

List all currently loaded LoRA adapters.

Returns the internal tracking set rather than querying vLLM to avoid the unreliable list endpoint (vLLM bug #11761).

Returns:

Type	Description
`list[str]`	Sorted list of adapter IDs currently tracked as loaded.

Example

adapters = await provider.list_adapters() print(adapters) # ["adapter-001", "adapter-002"]

Source code in libs/inference/src/inference/vllm_provider.py

async def list_adapters(self) -> list[str]:
    """List all currently loaded LoRA adapters.

    Returns the internal tracking set rather than querying vLLM to avoid
    the unreliable list endpoint (vLLM bug #11761).

    Returns:
        Sorted list of adapter IDs currently tracked as loaded.

    Example:
        >>> adapters = await provider.list_adapters()
        >>> print(adapters)  # ["adapter-001", "adapter-002"]
    """
    return sorted(self._loaded_adapters)

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

API Reference for inference¶

inference ¶

Classes¶

UnsupportedOperationError ¶

GenerationResult dataclass ¶

InferenceProvider ¶

Functions¶

generate abstractmethod async ¶

load_adapter abstractmethod async ¶

unload_adapter abstractmethod async ¶

list_adapters abstractmethod async ¶

LlamaCppProvider ¶

Functions¶

generate async ¶

load_adapter async ¶

unload_adapter async ¶

list_adapters async ¶

OllamaProvider ¶

Functions¶

generate async ¶

load_adapter async ¶

Raises UnsupportedOperationError¶

unload_adapter async ¶

Raises UnsupportedOperationError¶

list_adapters async ¶

TransformersProvider ¶

Functions¶

generate async ¶

load_adapter async ¶

unload_adapter async ¶

list_adapters async ¶

VLLMProvider ¶

Functions¶

generate async ¶

load_adapter async ¶

unload_adapter async ¶

list_adapters async ¶

Functions¶

get_provider ¶

get_provider_for_step ¶

Modules¶

exceptions ¶

Classes¶

UnsupportedOperationError ¶

factory ¶

Classes¶

Functions¶

get_provider ¶

get_provider_for_step ¶

llamacpp_provider ¶

Classes¶

LlamaCppProvider ¶

ollama_provider ¶

Classes¶

OllamaProvider ¶

Raises UnsupportedOperationError¶

Raises UnsupportedOperationError¶

provider ¶

Classes¶

GenerationResult dataclass ¶

InferenceProvider ¶

transformers_provider ¶

Classes¶

TransformersProvider ¶

vllm_provider ¶

Classes¶

VLLMProvider ¶

GenerationResult `dataclass` ¶

generate `abstractmethod` `async` ¶

load_adapter `abstractmethod` `async` ¶

unload_adapter `abstractmethod` `async` ¶

list_adapters `abstractmethod` `async` ¶

generate `async` ¶

load_adapter `async` ¶

unload_adapter `async` ¶

list_adapters `async` ¶

generate `async` ¶

load_adapter `async` ¶

unload_adapter `async` ¶

list_adapters `async` ¶

generate `async` ¶

load_adapter `async` ¶

unload_adapter `async` ¶

list_adapters `async` ¶

generate `async` ¶

load_adapter `async` ¶

unload_adapter `async` ¶

list_adapters `async` ¶

GenerationResult `dataclass` ¶