vLLM vs Ollama vs llama.cpp: Picking a Local Inference Engine in 2026

Q: Which quantization should I download?

For most local use, Q4KM is the default sweet spot, roughly 4.5 bits per weight, about 92-95% of full-precision quality, and it fits an 8B model in ~4GB. Step up to Q5KM or Q80 if you have spare VRAM and want a bit more fidelity; drop to Q40 or below only when memory is genuinely tight.

Three engines, three jobs. A practical 2026 guide to choosing between Ollama, vLLM, and llama.cpp based on concurrency, hardware, and scale.

Sam CarterJun 25, 2026 11 min read

Cover image for vLLM vs Ollama vs llama.cpp: Picking a Local Inference Engine in 2026 — Photo: Si-MOCs / flickr (BY-NC-SA 2.0)

Running open models yourself in 2026 means choosing an inference engine, and the three serious options, Ollama, vLLM, and llama.cpp, are not really competitors. They are tools built for different jobs. Pick wrong and you either fight needless complexity on a laptop or smack into a throughput wall in production. The good news: once you frame the choice around concurrency and hardware rather than raw single-stream speed, the answer is usually obvious. Here is how to decide.

Quick answer

Choose by concurrency and hardware, not single-stream speed. Use Ollama when one developer wants a model running in five minutes. Use vLLM for production GPU serving, where PagedAttention and continuous batching deliver many times Ollama's throughput once ten or more users hit it at once. Use llama.cpp when there is no discrete GPU or you need the lowest-level control, it is the CPU/edge specialist and the upstream project Ollama is built on. All three speak an OpenAI-compatible API, so moving a prototype to production is mostly a base-URL change.

Key takeaways

Ollama wins for a single developer who wants a model running in under five minutes, but it has no continuous batching, so it falls apart under concurrent load.
vLLM is the production answer for GPU serving: PagedAttention plus continuous batching keep the GPU at 85-92% utilization and can deliver 10x or more throughput over Ollama once ten-plus users hit it at once.
llama.cpp is the CPU-and-edge specialist and the lowest-level option, the most knobs, the most portability, the foundation both Ollama and many others are built on.
Single-user speed differences are modest (roughly 62-71 tok/s for Llama 3.1 8B across all three on an RTX 4090); most of that gap is quantization format, not the engine.
All three speak an OpenAI-compatible API, so prototyping on one and shipping on another is mostly a base-URL change.

The three engines at a glance

Before the detail, here is the whole decision in one table:

Engine	Best for	Hardware	Concurrency	Setup
Ollama	Single developer, prototypes	Any (GPU helps)	Poor (no continuous batching)	One command
vLLM	Production serving	NVIDIA/AMD GPU, Linux	Excellent (PagedAttention)	More config
llama.cpp	CPU and edge, full control	CPU or GPU, very portable	Manual	Build it yourself

The numbers back this up: single-stream speed is close across all three (~62-71 tok/s for Llama 3.1 8B on an RTX 4090), but under concurrent load vLLM pulls far ahead. The rest of this guide explains why.

Ollama: simplicity wins for single users

Ollama is a developer-friendly wrapper. Its Go-based server sits on top of an inference backend derived from llama.cpp, and the whole thing installs with one command. A built-in model registry handles downloads, and its OpenAI-compatible API drops straight into any LLM client or agent framework. Pulling and running a model is genuinely a two-line affair:

ollama pull llama3.1:8b
ollama run llama3.1:8b "Summarize this changelog in three bullets."

Behind the scenes ollama pull fetches a GGUF file, and the default library models ship pre-quantized at Q4_K_M, a 4-bit format that fits an 8B model in roughly 4GB while keeping around 92-95% of full-precision quality. On machines with more VRAM, Ollama may select a higher-precision variant like Q5_K_M automatically.

For a single user, that is the right trade: good single-stream performance, low setup friction, sensible resource use, and zero config. If you are building local agents or tooling, Ollama pairs nicely with structured generation, see our guide on Ollama structured outputs for forcing valid JSON out of these models.

Where Ollama struggles

Concurrency is the limit. Ollama processes requests largely one at a time with no continuous batching, so the moment you put a fleet of agents or a shared API in front of it, latency balloons. In 2026 benchmarks at 50 concurrent users, Ollama's p99 latency climbed to around 24.7 seconds while vLLM stayed under 3 seconds. In a simultaneous-fire stress test, one benchmark recorded vLLM at roughly 793 aggregate tok/s against Ollama's ~41 tok/s. Ollama is built for one person at a keyboard, not a load-bearing endpoint.

Rows of GPU servers in a data center, illustrating production-scale model serving — Photo: Takuya Oikawa / flickr (BY-SA 2.0)

vLLM: throughput for production

vLLM is a Python-based, GPU-first serving engine designed for high-throughput production workloads. Its defining feature remains PagedAttention, which manages the KV cache through a paging scheme borrowed from operating-system virtual memory. Instead of reserving one large contiguous block per request, it allocates fixed-size pages on demand. Traditional serving can waste up to ~87% of KV-cache memory to fragmentation and over-reservation; PagedAttention cuts that waste to near zero, which lets vLLM pack far more concurrent sequences onto the same card.

The second half of the story is continuous batching. Static batching waits for every request in a batch to finish before starting the next; continuous batching slots a new sequence into the batch the instant an old one completes. Together with chunked prefill, these techniques keep an H100 busy enough to serve 3-5x the traffic of a naive PyTorch loop, and 14-24x the throughput of raw HuggingFace Transformers on identical hardware.

Starting an OpenAI-compatible server is a single command:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

The numbers under load are stark. At 50 concurrent users vLLM delivered roughly 6x Ollama's total throughput; against llama.cpp at peak load it pushed more than 35x the requests per second and more than 44x the output tokens per second.

Note

vLLM also speaks the OpenAI-compatible API, so migrating from an Ollama prototype to a vLLM production endpoint is mostly a base-URL change, not a rewrite. Prototype on Ollama, ship on vLLM.

The cost of vLLM is operational. It expects an NVIDIA (or AMD ROCm) GPU, a Linux environment, and more configuration than Ollama. That is a fair price once real concurrent traffic is on the line, and overkill before then. If your strategy is fewer, smaller models doing more work, this efficiency mindset connects directly to the tokenmaxxing shift.

llama.cpp: the CPU and edge specialist

Neither Ollama nor vLLM is a practical choice when there is no discrete GPU. llama.cpp is the only one of the three that treats CPU inference as a first-class workload, using AVX2 and AVX-512 on x86 or NEON on ARM to squeeze usable speed from a CPU, and it remains the upstream project that Ollama and countless other tools depend on.

It is also the lowest-level option, which means the most knobs and the most manual work. You manage quantization, threading, and context settings yourself. Building and running looks like this:

# Build with CMake
cmake -B build
cmake --build build --config Release -j

# Run an interactive session against a GGUF model
./build/bin/llama-cli \
  -m models/llama-3.1-8b-instruct-Q4_K_M.gguf \
  -p "Explain PagedAttention in one paragraph." \
  -t 8 -c 8192

That Q4_K_M suffix is doing real work. The "K" means K-quants (mixed precision), and "M" means medium: most weights sit at 4-bit, but the most sensitive layers are kept at 6-bit, landing around 4.5 effective bits per weight. It is the format behind well over 70% of local model downloads, and you produce it yourself with llama-quantize, converting an FP16 or BF16 GGUF into Q4_K_M, Q5_K_M, or Q8_0 depending on your size-versus-quality budget.

The reward is portability, llama.cpp runs almost anywhere, including edge hardware, Raspberry Pis, and laptops with no GPU at all. That makes it the natural backbone for small language models running on-device. Ironically, since Ollama is built on llama.cpp, reaching for llama.cpp directly is what you do once you have outgrown Ollama's defaults but cannot or will not move to a GPU server.

A decision checklist

Ask these questions in order:

Do you have a discrete GPU? No → llama.cpp.
Will more than a few users hit it at once? Yes → vLLM.
Is this a prototype or a single developer's tool? Yes → Ollama.
Have you outgrown Ollama's defaults but stuck on CPU? → llama.cpp direct.

Tip

Resist running production traffic through Ollama just because the prototype already uses it. The concurrency gap is too large to ignore once you have real users. The good news is the OpenAI-compatible API means the switch is cheap.

A note on benchmarks: single-stream numbers flatter Ollama and llama.cpp. On an RTX 4090 with Llama 3.1 8B, all three land within a few tokens per second of each other (~62 for Ollama, ~65 for llama.cpp, ~71 for vLLM at FP16), and most of that spread is the quantization format, not the engine. The gulf only opens at concurrency, which is exactly where toy benchmarks rarely look. If you are sizing a serving tier behind a retrieval or memory layer, the same concurrency math applies to your vector store; compare options in pgvector vs Qdrant and think about how state is managed in agent memory. For the quantization formats these engines load, our guide to LLM quantization (GGUF, AWQ, FP8) goes deeper.

What to do right now

Pick an engine without overthinking it:

No discrete GPU? Use llama.cpp (or Ollama for convenience on top of it).
One developer or a prototype? Install Ollama and ollama pull a Q4_K_M model.
More than a few simultaneous users? Stand up vLLM on a GPU and serve with PagedAttention plus continuous batching.
Prototyped on Ollama, now scaling? Keep the OpenAI-compatible client and just switch the base URL to vLLM.
Downloading a model? Start with Q4_K_M; step up to Q5_K_M or Q8_0 only if you have spare VRAM.

Frequently asked questions

Is vLLM always faster than Ollama?

No. For a single request, the two are within a few tokens per second of each other, and the difference is mostly the quantization format. vLLM's advantage is throughput under concurrency, once you have many simultaneous requests, continuous batching and PagedAttention let it serve 6x to 10x more total tokens per second on the same GPU.

Can I run vLLM on a CPU or a Mac?

vLLM is GPU-first and expects NVIDIA or AMD hardware on Linux for production use. For CPU-only or Apple Silicon, llama.cpp (and its Metal backend) or Ollama are the practical choices; there are MLX-based forks aimed at Macs, but they are a separate track from mainline vLLM.

What is the difference between Ollama and llama.cpp if Ollama is built on it?

Ollama wraps a llama.cpp-derived backend with a model registry, automatic VRAM management, an OpenAI-compatible server, and one-command installation. llama.cpp gives you those internals raw: you manage quantization, threads, and context yourself. Use Ollama for convenience; drop to llama.cpp when you need control Ollama abstracts away.

Which quantization should I download?

For most local use, Q4_K_M is the default sweet spot, roughly 4.5 bits per weight, about 92-95% of full-precision quality, and it fits an 8B model in ~4GB. Step up to Q5_K_M or Q8_0 if you have spare VRAM and want a bit more fidelity; drop to Q4_0 or below only when memory is genuinely tight.

The takeaway

There is no single best engine in 2026, there are three good answers to three different questions. Ollama for one developer who wants zero friction, vLLM for production concurrency on a GPU, and llama.cpp when the CPU is all you have. Match the engine to the workload and you sidestep both wasted complexity and avoidable latency.

#ai#inference#vllm#llama.cpp