Blog · IA
Hosting an open source LLM: vLLM, Ollama, and sovereignty

When sovereignty or scale demands it, self-hosting an open source model is the way to go. Ollama for getting started, vLLM for production.
Using an LLM via an API (OpenAI, Mistral) is straightforward. But as soon as the conversation turns to strict sovereignty or very high volume, another path emerges: self-hosting an open source model. Two tools stand out — Ollama and vLLM — for two distinct use cases.
Why self-host an LLM
Three reasons, rarely just one:
- Sovereignty: your data never leaves your infrastructure. Critical in healthcare, finance, defense, and the public sector.
- Cost: at very high volume, the per-call cost of an API exceeds that of a well-utilized dedicated GPU.
- Control: fixed model version, no dependency on a provider that might change prices or models.
Ollama: simple, for getting started and local use
Ollama makes running an open source model (Llama, Mistral, etc.) on a machine effortless. Ideal for prototyping, local use, or moderate volume. Its limitation: it’s not designed to handle thousands of concurrent requests in production.
vLLM: high-throughput production
vLLM is an inference engine optimized for throughput. On GPU (Scaleway, OVH), it serves many requests in parallel with controlled latency, thanks to techniques like the continuous batching. It’s the tool when self-hosting needs to handle real load.
When to self-host, when to use an API
- API (Mistral in the EU) for most projects: quick start, latest models, no GPU ops. See our Agence Mistral.
- Self-hosted (Ollama/vLLM) when sovereignty is strict, volume is very high, or both.
The choice depends on the required confidentiality level and the actual cost at your volume—it’s one of the trade-offs in our AI assistants connected to your data.
Need sovereignty constraints for your AI data? We’ll size the infrastructure with you.


