Frequently asked questions

Which model should I start with?

For a single 24 GB consumer GPU, start with Qwen2.5 14B-Instruct at Q4 quantization — it's the size-vs-capability sweet spot for that hardware. For 48 GB, jump to Gemma 2 27B or Yi-1.5 34B. For a workstation with 80 GB+, Llama 3.3 70B at Q4 is the most popular default.

What's the difference between FP16, GGUF, AWQ, and EXL2?

FP16 (or BF16) is the original published precision: largest file size, highest quality. GGUF is the format used by llama.cpp, with optional integer quantization (Q4_K_M, Q5_K_M, Q8_0, etc.) for smaller files at some quality cost. AWQ and GPTQ are GPU-only quantization formats that retain better quality at low bit-depths. EXL2 is exllama's variable-bitrate format, optimized for fast GPU inference.

For most users: run GGUF Q4_K_M on llama.cpp or Ollama. It's the right balance and the broadest tooling support.

Can I use these models commercially?

Mostly yes — but check each license. Apache 2.0 and MIT are fully permissive. Llama, Gemma, Qwen, and GLM licenses are commercial-friendly with specific clauses you should read. Command R+ is non-commercial only. See our license summary for a side-by-side.

How much VRAM do I need?

Rough rule of thumb at Q4_K_M GGUF: ~0.6 GB per billion parameters, plus 1–4 GB for KV cache depending on context length. A 14B model needs ~10 GB at Q4 for short context, ~14 GB if you want the full context window comfortably. Always leave 2 GB headroom for system use.

Why is my model slower than the benchmarks suggest?

Common causes: (1) you're CPU-bottlenecked because layers are spilling to RAM, (2) batch size is 1 when the benchmark assumed higher, (3) you're running an older driver or older inference framework, (4) the published number was tested with speculative decoding. Check VRAM utilization first.

How do I actually verify a model download?

Every Hugging Face model repository publishes SHA-256 hashes for each file. The huggingface-cli tool verifies them automatically. If you're scripting a deployment, pin to a specific revision (a commit hash, not main) so you can detect upstream changes.

Are these benchmarks trustworthy?

Treat published benchmarks as a floor, not a ceiling — vendors don't publish numbers where they look bad. For your specific use case, run the model on 20–50 realistic examples. That's almost always more informative than any leaderboard.

Why do you not list GPT-4 / Claude / Gemini?

This catalog is open-weight only — models you can download and run yourself. Closed commercial models are well-documented on their vendors' own sites.

How fresh is this list?

Each entry shows a release date in its detail page. We review the catalog every six months and retire entries that have been superseded by strictly better alternatives. The list is intentionally curated and small.