Skip to main content
AY
Overview
The Local LLM Slowness Trap: Why 8GB VRAM isn't Enough for Claude Code

The Local LLM Slowness Trap: Why 8GB VRAM isn't Enough for Claude Code

May 5, 2026
2 min read

The Local LLM Slowness Trap: Why 8GB VRAM isn’t Enough for Claude Code

There is a common misconception in the local LLM space: “If it runs in the terminal, it works for development.”

I recently put this to the test on my RTX 3060 (8GB). I successfully installed Gemma 4:E4B via Ollama. In a standalone console chat, it was responsive and usable. But the moment I tried to bridge it into Claude Code for actual AI-assisted development, the experience fell apart.

It didn’t crash; it just became painfully, impossibly slow.

The Reality Check Table

Before you spend hours on setup, here is how the performance actually scales on mid-range hardware:

SetupHardwareExperience
Gemma 4 (Console Chat)RTX 3060 8GBPassable. Snappy responses for short, stateless queries.
Gemma 4 + Claude CodeRTX 3060 8GBUnusable. Massive lag (0.2 tok/s) as context offloads to system RAM.
Qwen 2.5 3B (Agentic)RTX 3060 8GBSweet Spot. Fast enough for real-time multi-file refactoring.

The “Terminal vs. Agent” Performance Gap

Why does a model that feels “fine” in a chat console suddenly lag when you’re coding?

1. The Context Inflation

In a standard Ollama chat, you are sending short strings of text. In a coding agent scenario (like Claude Code), the agent is often injecting hundreds of lines of code into the prompt to provide “context.”

2. VRAM Saturation and Offloading

With only 8GB of VRAM, loading Gemma 4 (8B) at 4-bit quantization already eats up about 5.5GB. Once you add a few thousand tokens of code context, your GPU’s memory hits 100%.

Instead of failing, the system starts offloading parts of the processing to your system RAM. Because system RAM is significantly slower than VRAM, your token-per-second rate drops from a smooth 40+ to a crawl of 0.2 or 0.5.

My New Local LLM Research Checklist

To stop wasting time on setups that don’t scale, I’m moving to a more rigorous testing plan:

  • Don’t Test with “Hello”: A simple greeting won’t tell you anything. Test by asking the model to refactor a 200-line file.
  • Watch the “tok/s”: If your speed drops during a multi-file task, you’ve hit your VRAM ceiling.
  • The “Agent Buffer”: Always assume an agentic workflow requires 2GB to 3GB more VRAM than the base model requirements just to handle the context window.

The Verdict

For those of us on 8GB cards, the reality check is clear: Gemma 4:E4B is great for a side-chat, but for integrated coding, it’s a bottleneck. For my Apex RFQ or Kombucha app builds, I’ll be sticking to smaller, highly efficient models that can stay entirely within the GPU memory.


Have you noticed a lag when moving from terminal chat to agentic tools? Let’s discuss in the comments.

Loading comments...