The Local LLM Slowness Trap: Why 8GB VRAM isn’t Enough for Claude Code
There is a common misconception in the local LLM space: “If it runs in the terminal, it works for development.”
I recently put this to the test on my RTX 3060 (8GB). I successfully installed Gemma 4:E4B via Ollama. In a standalone console chat, it was responsive and usable. But the moment I tried to bridge it into Claude Code for actual AI-assisted development, the experience fell apart.
It didn’t crash; it just became painfully, impossibly slow.
The Reality Check Table
Before you spend hours on setup, here is how the performance actually scales on mid-range hardware:
| Setup | Hardware | Experience |
|---|---|---|
| Gemma 4 (Console Chat) | RTX 3060 8GB | Passable. Snappy responses for short, stateless queries. |
| Gemma 4 + Claude Code | RTX 3060 8GB | Unusable. Massive lag (0.2 tok/s) as context offloads to system RAM. |
| Qwen 2.5 3B (Agentic) | RTX 3060 8GB | Sweet Spot. Fast enough for real-time multi-file refactoring. |
The “Terminal vs. Agent” Performance Gap
Why does a model that feels “fine” in a chat console suddenly lag when you’re coding?
1. The Context Inflation
In a standard Ollama chat, you are sending short strings of text. In a coding agent scenario (like Claude Code), the agent is often injecting hundreds of lines of code into the prompt to provide “context.”
2. VRAM Saturation and Offloading
With only 8GB of VRAM, loading Gemma 4 (8B) at 4-bit quantization already eats up about 5.5GB. Once you add a few thousand tokens of code context, your GPU’s memory hits 100%.
Instead of failing, the system starts offloading parts of the processing to your system RAM. Because system RAM is significantly slower than VRAM, your token-per-second rate drops from a smooth 40+ to a crawl of 0.2 or 0.5.
My New Local LLM Research Checklist
To stop wasting time on setups that don’t scale, I’m moving to a more rigorous testing plan:
- Don’t Test with “Hello”: A simple greeting won’t tell you anything. Test by asking the model to refactor a 200-line file.
- Watch the “tok/s”: If your speed drops during a multi-file task, you’ve hit your VRAM ceiling.
- The “Agent Buffer”: Always assume an agentic workflow requires 2GB to 3GB more VRAM than the base model requirements just to handle the context window.
The Verdict
For those of us on 8GB cards, the reality check is clear: Gemma 4:E4B is great for a side-chat, but for integrated coding, it’s a bottleneck. For my Apex RFQ or Kombucha app builds, I’ll be sticking to smaller, highly efficient models that can stay entirely within the GPU memory.
Have you noticed a lag when moving from terminal chat to agentic tools? Let’s discuss in the comments.