Blog · IA

Hosting an open source LLM: vLLM, Ollama, and sovereignty

11 juin 20265 min de lecturepar Scroll

Héberger un LLM open source : vLLM, Ollama et la souveraineté

When sovereignty or scale demands it, self-hosting an open source model is the way to go. Ollama for getting started, vLLM for production.

Using an LLM via an API (OpenAI, Mistral) is straightforward. But as soon as the conversation turns to strict sovereignty or very high volume, another path emerges: self-hosting an open source model. Two tools stand out — Ollama and vLLM — for two distinct use cases.

Why self-host an LLM

Three reasons, rarely just one:

Sovereignty: your data never leaves your infrastructure. Critical in healthcare, finance, defense, and the public sector.
Cost: at very high volume, the per-call cost of an API exceeds that of a well-utilized dedicated GPU.
Control: fixed model version, no dependency on a provider that might change prices or models.

Ollama: simple, for getting started and local use

Ollama makes running an open source model (Llama, Mistral, etc.) on a machine effortless. Ideal for prototyping, local use, or moderate volume. Its limitation: it’s not designed to handle thousands of concurrent requests in production.

vLLM: high-throughput production

vLLM is an inference engine optimized for throughput. On GPU (Scaleway, OVH), it serves many requests in parallel with controlled latency, thanks to techniques like the continuous batching. It’s the tool when self-hosting needs to handle real load.

When to self-host, when to use an API

API (Mistral in the EU) for most projects: quick start, latest models, no GPU ops. See our Agence Mistral.
Self-hosted (Ollama/vLLM) when sovereignty is strict, volume is very high, or both.

The choice depends on the required confidentiality level and the actual cost at your volume—it’s one of the trade-offs in our AI assistants connected to your data.

Need sovereignty constraints for your AI data? We’ll size the infrastructure with you.

Hosting an open source LLM: vLLM, Ollama, and sovereignty

Why self-host an LLM

Ollama: simple, for getting started and local use

vLLM: high-throughput production

When to self-host, when to use an API

LangChain, LangGraph, LlamaIndex: Which Framework for Your AI?

MCP (Model Context Protocol): Connect Your Tools to AI, Cleanly

pgvector or Qdrant: Which vector database for your RAG?