The Problem
I use Claude as my main AI assistant for writing, scripting, and general IT work. It is genuinely useful, but API credits are not free. When you are doing something repetitive like reviewing drafts, critiquing code, or generating multiple versions of the same thing, those credits add up fast.
I wanted a way to keep Claude for the tasks it is best at while offloading the heavy, repetitive work to something that costs nothing to run.
The answer was running AI models locally on my own machine.
What is Ollama
Ollama is a tool that lets you download and run AI language models directly on your computer, completely offline, at no cost per use. You run it once, pull whatever models you want, and from that point on every query you send to those models is free.
The tradeoff is that local models are generally not as capable as Claude. They can make mistakes, miss context, or produce weaker results on complex tasks. But for structured, well-defined jobs like generating a first draft or picking apart a piece of writing, they do the job well enough.
The idea is not to replace Claude. It is to use local models for the grunt work and save Claude for the final judgment call.
The Setup
I am running this on a machine with an NVIDIA RTX 3080 (16GB VRAM) and 61GB of RAM. The GPU is what makes local models fast. Without a decent GPU you can still run them on CPU, but expect slower response times.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Verify it installed correctly:
ollama --version
Pull the models
I settled on four models, each with a specific job:
ollama pull llama3.1:8b
ollama pull qwen2.5-coder:14b
ollama pull deepseek-r1:8b
ollama pull gemma3:12b
This will take a few minutes depending on your internet connection. The files range from around 5GB to 9GB each.
Why These Four Models
Each model has a different strength, which is why the pipeline uses all four rather than just picking one.
llama3.1:8b is Meta’s Llama 3.1 at 8 billion parameters. It is fast and solid for general tasks. It handles the first draft.
gemma3:12b is Google’s Gemma 3 at 12 billion parameters. It approaches problems differently than Llama, which is the point. Having two models generate independent responses means you get two different perspectives on the same question.
deepseek-r1:8b is a reasoning model. Unlike the others, it thinks through problems step by step before answering. This makes it well suited for critiquing the other two responses. It will catch things a straightforward model skips over.
qwen2.5-coder:14b is Alibaba’s code-specialized model at 14 billion parameters. It handles the final refinement step. Because it is the largest and most specialized model in the group, it produces the best polished output.
How the Pipeline Works
Instead of sending a single prompt to a single model, the pipeline runs the query through all four models in a specific sequence.
Step 1: Generate (parallel) Llama and Gemma both receive the prompt at the same time and produce independent responses. Running them in parallel saves time.
Step 2: Critique DeepSeek-R1 receives both responses along with the original prompt. Its job is to reason through what each response got right, what it got wrong, and what an ideal combined answer would look like.
Step 3: Refine Qwen receives everything: the original prompt, both drafts, and the critique. It produces the final response, incorporating the best parts of both drafts and addressing the issues DeepSeek flagged.
The result is consistently better than what any single model would produce on its own.
The Script
Save this as pipeline.py. It requires Python 3 and Ollama running in the background.
import argparse
import json
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor
MODELS = {
"generator_a": "llama3.1:8b",
"generator_b": "gemma3:12b",
"critic": "deepseek-r1:8b",
"refiner": "qwen2.5-coder:14b",
}
def ollama_run(model, prompt):
result = subprocess.run(
["ollama", "run", model],
input=prompt,
capture_output=True,
text=True,
timeout=300,
)
if result.returncode != 0:
raise RuntimeError(f"{model} failed: {result.stderr}")
return result.stdout.strip()
def run_pipeline(prompt):
print("\n[1/3] Generating drafts in parallel...")
with ThreadPoolExecutor(max_workers=2) as executor:
fa = executor.submit(ollama_run, MODELS["generator_a"], prompt)
fb = executor.submit(ollama_run, MODELS["generator_b"], prompt)
draft_a, draft_b = fa.result(), fb.result()
print("[2/3] Critiquing both drafts...")
critic_prompt = f"""You are a rigorous critic. Analyze both responses to this prompt.
ORIGINAL PROMPT:
{prompt}
RESPONSE A:
{draft_a}
RESPONSE B:
{draft_b}
Evaluate accuracy, completeness, and clarity. Summarize what the ideal response should contain."""
criticism = ollama_run(MODELS["critic"], critic_prompt)
print("[3/3] Refining final response...")
refiner_prompt = f"""You are an expert synthesizer. Using the two drafts and the critique below, write the best possible final response.
ORIGINAL PROMPT:
{prompt}
DRAFT A:
{draft_a}
DRAFT B:
{draft_b}
CRITIQUE:
{criticism}
Write the final improved response."""
final = ollama_run(MODELS["refiner"], refiner_prompt)
print("\n--- Final Response ---")
print(final)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("prompt", nargs="?")
args = parser.parse_args()
if not args.prompt:
print("Enter your prompt (Ctrl+D when done):")
args.prompt = sys.stdin.read().strip()
run_pipeline(args.prompt)
Run it
python3 pipeline.py "explain how DNS works"
Or for longer prompts:
python3 pipeline.py "write a PowerShell script that checks disk usage on all drives and emails an alert if any drive is above 80 percent"
When to Use This vs Claude
This pipeline is not meant to replace Claude for everything. Here is how I actually split the work:
| Task | Use |
|---|---|
| Writing first drafts | Pipeline |
| Reviewing and critiquing documents | Pipeline |
| Code generation with multiple approaches | Pipeline |
| Complex reasoning or planning | Claude |
| Anything where accuracy really matters | Claude |
| Final polish on important work | Claude |
The pipeline handles volume. Claude handles quality-critical work.
Worth Knowing
The pipeline takes longer than a single model query. Expect anywhere from two to four minutes per run depending on your hardware, since it is running four models in sequence with one parallel step.
Each model file sits between 5GB and 9GB on disk, so the full set takes around 28GB of storage. Make sure you have space before pulling all four.
The models run entirely on your machine. Nothing is sent to an external server. For work involving internal documentation or anything sensitive, this matters.
If you only have a CPU and no GPU, the pipeline will still work but will be significantly slower. A single query could take 10 to 15 minutes. It is still free, just slower.