Mixture of Experts (MoE) LLMs Explained for 2026

Why MoE dominates 2026 LLMs: how active vs total parameters, routing, and top-k experts deliver big-model quality at small-model speed.

Sam CarterJun 14, 2026 8 min read

Cover image for Mixture of Experts (MoE) LLMs Explained for 2026 — Photo: Breno Peck / flickr (BY-NC-SA 2.0)

If you have looked at the spec sheet for a 2026 frontier open-weight model and seen something like "671 billion parameters, 37 billion active," you have met a Mixture of Experts.

Quick answer

A Mixture of Experts (MoE) model replaces each transformer's feed-forward layer with many specialized "expert" networks plus a router that activates only a few per token. That splits total parameters (the model's full capacity, all loaded in memory) from active parameters (the slice that fires per token), so you get big-model quality at small-model compute cost. DeepSeek-V3 holds 671B total but activates only about 37B per token. The catch: you still need memory for every expert, so MoE saves compute, not VRAM.

MoE is the architecture behind most of the strongest models shipping today, DeepSeek, Qwen, and several Mistral variants all use it. The idea is simple to state and powerful in practice: hold an enormous amount of knowledge in the model, but only switch on the small slice of it that each token actually needs.

Key takeaways

A Mixture of Experts replaces each transformer feed-forward layer with many specialized "expert" networks plus a router that picks a few per token.
The defining trick is the split between total parameters (the model's full capacity) and active parameters (what fires per token).
A gating network uses top-k routing to send each token to its k best experts, top-2 is common, DeepSeek-V3 uses top-8 of 256.
The payoff is big-model quality at small-model inference cost; the price is harder deployment and large memory footprints.
MoE is now mainstream, not exotic, most leading 2026 open-weight models rely on it.

The dense-model problem

In a standard dense transformer, every token passes through every parameter. A 70-billion-parameter model uses all 70 billion for the word "the" and for a subtle legal clause alike. That is wasteful and it scales badly: to make the model smarter you add parameters, and every one of them now costs compute on every token. The bill grows faster than the benefit.

MoE breaks the link between how much a model knows and how much it computes per token. You can keep adding capacity without making each token more expensive to process.

Total vs active parameters

This distinction is the whole point, so it is worth being precise. The total parameter count is the model's full size, everything stored on disk and loaded into memory. The active count is how many parameters actually participate in processing a single token.

The classic illustration is Mixtral 8x7B: it holds roughly 47 billion parameters total but, because only two of its eight experts fire per token, it runs inference at about the speed of a 13-billion-parameter model. DeepSeek-R1 takes this further, 671 billion parameters total, but only around 37 billion active per token. You pay storage and memory for the full size, but you pay compute only for the active slice.

Comparing a few well-known 2026 models makes the gap concrete:

Model	Total params	Active per token	Routing	Practical home
Mixtral 8x7B	~47B	~13B	Top-2 of 8	High-end workstation
DeepSeek-V3 / R1	671B	~37B	Top-8 of 256	Multi-GPU server
Qwen3 MoE variants	30B-235B	3B-22B	Top-k routing	Server to workstation
Dense 70B (for contrast)	70B	70B (all)	None	Single big GPU

Glowing branching pathways splitting toward different nodes representing expert routing — Photo: patriziasoliani / flickr (BY-NC 2.0)

How routing works

Each MoE layer contains many expert feed-forward networks and a small gating network, or router. For every token, the router scores the experts and selects the top k with the highest scores. Only those experts run; the rest sit idle for that token.

Top-2 routing (used in Mixtral) is the common balance, two experts per token gives diversity without much overhead.
DeepSeek-V3 uses top-8 of 256 experts, a much finer-grained routing that lets experts specialize narrowly.

Over training, experts drift toward specialties, some lean toward code, some toward particular languages, some toward math, though the specialization is emergent and rarely as clean as the label suggests.

It is tempting to picture the experts as neat departments, a "Python expert," a "French expert," a "calculus expert," but the reality is messier and more interesting. The router learns whatever partitioning of the data happens to minimize loss, and that partitioning often cuts across human categories. One expert might specialize in a particular grammatical construction that shows up in several languages; another might handle numeric tokens regardless of the surrounding topic. Researchers who probe trained MoE models find some interpretable specialization, but also a great deal that resists tidy explanation. The practical upshot is that you should not over-read the word "expert." It is a routing destination chosen by a learned gating function, not a hand-designed module with a job title. What matters for performance is not what each expert "means" but that the router spreads load well and consistently sends each token somewhere useful.

Warning

Routing has to stay balanced. If the gating network keeps favoring a handful of experts, the rest go undertrained and capacity is wasted. MoE training uses an auxiliary load-balancing loss to spread tokens across experts. Get this wrong and the model effectively shrinks to its few popular experts.

The deployment catch

MoE is not free lunch. The trade-off lands on memory and infrastructure:

You still load every expert. Active parameters are low, but all 671 billion of DeepSeek-R1's parameters must sit in memory in case the router calls them. MoE saves compute, not VRAM.
Routing adds complexity. Distributing experts across GPUs, keeping the router balanced, and handling uneven token loads make serving harder than a dense model.
Batching is trickier. Different tokens in a batch want different experts, which complicates the neat parallelism dense models enjoy.

This is why MoE shines in server deployments with ample memory and struggles on memory-constrained edge devices, where the dense small models discussed in small language models on-device remain the better fit. If you are choosing among open-weight options, the best open-weight LLMs flags which are MoE and what hardware they demand.

Why it won 2026

The math is compelling. MoE lets labs scale total capacity, and therefore capability, while holding per-token inference cost roughly flat. For anyone serving models at scale, that decoupling is the difference between an affordable product and an unaffordable one. Combined with the inference tricks in LLM quantization, MoE is how 2026 models got both bigger and cheaper to run at the same time.

What to check before you pick an MoE model

If you are actually choosing a model to deploy, read the spec sheet with these questions in mind:

Look at active parameters for your compute budget and total parameters for your memory budget; they are different constraints.
Confirm your GPUs (or pooled VRAM) can hold every expert, since all of them load even if only a few fire per token.
Prefer finer-grained routing (top-8 of many) when you want specialization, but expect more serving complexity.
For edge or laptop use, skip large MoE and pick a dense small model instead, the memory math rarely works.
Pair MoE with quantization to fit a bigger model into the memory you have, but verify quality holds at your chosen precision.

Frequently asked questions

Does MoE make a model faster or smaller?

Faster, not smaller. MoE reduces the compute per token by activating only a few experts, so inference is quicker. But every expert still has to be loaded into memory, so the memory footprint matches the full total-parameter size.

What is the difference between total and active parameters?

Total parameters are the model's entire size, all loaded into memory. Active parameters are the subset that actually process a given token. A model can have 671 billion total but only 37 billion active, giving it large-model knowledge at small-model compute cost.

What is top-k routing?

It is how MoE picks experts. For each token, a gating network scores all experts and selects the k highest-scoring ones to run. Top-2 is a popular balance; some models route to eight or more of hundreds of experts for finer specialization.

Can I run a MoE model on a consumer GPU?

Usually not the large ones. Because all experts must be in memory, a 671-billion-parameter MoE needs server-class memory even though only a fraction is active per token. Quantization helps, but big MoE models are generally server, not laptop, workloads.

#ai#architecture