Using Local LLMs with Ollama to Save on AI API Credits

The Problem

I use Claude as my main AI assistant for writing, scripting, and general IT work. It is genuinely useful, but API credits are not free. When you are doing something repetitive like reviewing drafts, critiquing code, or generating multiple versions of the same thing, those credits add up fast.

I wanted a way to keep Claude for the tasks it is best at while offloading the heavy, repetitive work to something that costs nothing to run.

The answer was running AI models locally on my own machine.

What is Ollama

Ollama is a tool that lets you download and run AI language models directly on your computer, completely offline, at no cost per use. You run it once, pull whatever models you want, and from that point on every query you send to those models is free.

The tradeoff is that local models are generally not as capable as Claude. They can make mistakes, miss context, or produce weaker results on complex tasks. But for structured, well-defined jobs like generating a first draft or picking apart a piece of writing, they do the job well enough.

The idea is not to replace Claude. It is to use local models for the grunt work and save Claude for the final judgment call.

The Setup

I am running this on a machine with an NVIDIA RTX 3080 (16GB VRAM) and 61GB of RAM. The GPU is what makes local models fast. Without a decent GPU you can still run them on CPU, but expect slower response times.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Verify it installed correctly:

ollama --version

Pull the models

I settled on four models, each with a specific job:

ollama pull llama3.1:8b
ollama pull qwen2.5-coder:14b
ollama pull deepseek-r1:8b
ollama pull gemma3:12b

This will take a few minutes depending on your internet connection. The files range from around 5GB to 9GB each.

Why These Four Models

Each model has a different strength, which is why the pipeline uses all four rather than just picking one.

llama3.1:8b is Meta’s Llama 3.1 at 8 billion parameters. It is fast and solid for general tasks. It handles the first draft.

gemma3:12b is Google’s Gemma 3 at 12 billion parameters. It approaches problems differently than Llama, which is the point. Having two models generate independent responses means you get two different perspectives on the same question.

deepseek-r1:8b is a reasoning model. Unlike the others, it thinks through problems step by step before answering. This makes it well suited for critiquing the other two responses. It will catch things a straightforward model skips over.

qwen2.5-coder:14b is Alibaba’s code-specialized model at 14 billion parameters. It handles the final refinement step. Because it is the largest and most specialized model in the group, it produces the best polished output.

How the Pipeline Works

Instead of sending a single prompt to a single model, the pipeline runs the query through all four models in a specific sequence.

Step 1: Generate (parallel) Llama and Gemma both receive the prompt at the same time and produce independent responses. Running them in parallel saves time.

Step 2: Critique DeepSeek-R1 receives both responses along with the original prompt. Its job is to reason through what each response got right, what it got wrong, and what an ideal combined answer would look like.

Step 3: Refine Qwen receives everything: the original prompt, both drafts, and the critique. It produces the final response, incorporating the best parts of both drafts and addressing the issues DeepSeek flagged.

The result is consistently better than what any single model would produce on its own.

The Script

Save this as pipeline.py. It requires Python 3 and Ollama running in the background.

import argparse
import json
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor

MODELS = {
    "generator_a": "llama3.1:8b",
    "generator_b": "gemma3:12b",
    "critic":      "deepseek-r1:8b",
    "refiner":     "qwen2.5-coder:14b",
}

def ollama_run(model, prompt):
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt,
        capture_output=True,
        text=True,
        timeout=300,
    )
    if result.returncode != 0:
        raise RuntimeError(f"{model} failed: {result.stderr}")
    return result.stdout.strip()

def run_pipeline(prompt):
    print("\n[1/3] Generating drafts in parallel...")
    with ThreadPoolExecutor(max_workers=2) as executor:
        fa = executor.submit(ollama_run, MODELS["generator_a"], prompt)
        fb = executor.submit(ollama_run, MODELS["generator_b"], prompt)
        draft_a, draft_b = fa.result(), fb.result()

    print("[2/3] Critiquing both drafts...")
    critic_prompt = f"""You are a rigorous critic. Analyze both responses to this prompt.

ORIGINAL PROMPT:
{prompt}

RESPONSE A:
{draft_a}

RESPONSE B:
{draft_b}

Evaluate accuracy, completeness, and clarity. Summarize what the ideal response should contain."""

    criticism = ollama_run(MODELS["critic"], critic_prompt)

    print("[3/3] Refining final response...")
    refiner_prompt = f"""You are an expert synthesizer. Using the two drafts and the critique below, write the best possible final response.

ORIGINAL PROMPT:
{prompt}

DRAFT A:
{draft_a}

DRAFT B:
{draft_b}

CRITIQUE:
{criticism}

Write the final improved response."""

    final = ollama_run(MODELS["refiner"], refiner_prompt)

    print("\n--- Final Response ---")
    print(final)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("prompt", nargs="?")
    args = parser.parse_args()

    if not args.prompt:
        print("Enter your prompt (Ctrl+D when done):")
        args.prompt = sys.stdin.read().strip()

    run_pipeline(args.prompt)

Run it

python3 pipeline.py "explain how DNS works"

Or for longer prompts:

python3 pipeline.py "write a PowerShell script that checks disk usage on all drives and emails an alert if any drive is above 80 percent"

When to Use This vs Claude

This pipeline is not meant to replace Claude for everything. Here is how I actually split the work:

Task	Use
Writing first drafts	Pipeline
Reviewing and critiquing documents	Pipeline
Code generation with multiple approaches	Pipeline
Complex reasoning or planning	Claude
Anything where accuracy really matters	Claude
Final polish on important work	Claude

The pipeline handles volume. Claude handles quality-critical work.

Worth Knowing

The pipeline takes longer than a single model query. Expect anywhere from two to four minutes per run depending on your hardware, since it is running four models in sequence with one parallel step.

Each model file sits between 5GB and 9GB on disk, so the full set takes around 28GB of storage. Make sure you have space before pulling all four.

The models run entirely on your machine. Nothing is sent to an external server. For work involving internal documentation or anything sensitive, this matters.

If you only have a CPU and no GPU, the pipeline will still work but will be significantly slower. A single query could take 10 to 15 minutes. It is still free, just slower.

The Problem#

What is Ollama#

The Setup#

Install Ollama#

Pull the models#

Why These Four Models#

How the Pipeline Works#

The Script#

Run it#

When to Use This vs Claude#

Worth Knowing#