DeepSeek · Released December 26, 2024

DeepSeek-V3

DeepSeek-V3 is a 671-billion-parameter Mixture-of-Experts language model that activates only 37 billion parameters per token. It became notable not just for its benchmark performance — competitive with the strongest closed-weight models on most evaluations — but also for being trained for under $6M of GPU time, an order of magnitude less than comparable Western models.

Architecture

The model uses Multi-head Latent Attention (MLA) for efficient KV caching, an auxiliary-loss-free load balancing scheme for the experts, and Multi-Token Prediction during training. None of these are exotic on their own; the combination, executed carefully, is what produced the cost story.

What it's good at

Code generation and math are particularly strong, surpassing earlier open-weight models by a wide margin. English and Chinese instruction-following are both solid. Multi-turn dialogue holds up well through long contexts.

Running it locally

Full FP8 weights are roughly 670 GB and require a serious node — typically 8× H100 80 GB or 8× MI300X. For most enthusiasts, the practical path is Unsloth's dynamic-quant GGUFs (down to ~140 GB at 1.58-bit) on a multi-GPU rig, or simply using one of the hosted inference providers.

License

Released under the MIT license — among the most permissive options in the catalog. Commercial use, redistribution, and fine-tuning are all unrestricted.