Local LLMs vs Cloud APIs: When Privacy Actually Matters

Every time I paste code into ChatGPT, a small voice in my head asks: "Is this going to end up in someone's training data?"

The answer is probably no (if you've opted out). But "probably" isn't good enough when you're working with client code, proprietary algorithms, or that embarrassing regex you wrote at 2 AM.

So I started running LLMs locally. Here's the honest breakdown of when it makes sense, when it doesn't, and what it actually costs.

The Privacy Argument (And Why Most People Get It Wrong)

The discourse around AI privacy is polarized. One camp says "OpenAI sees everything you type." The other says "Who cares, my code isn't special."

Neither is completely right.

Here's what actually happens with cloud AI services:

OpenAI/Anthropic (as of early 2025):

API calls: Not used for training by default
ChatGPT/Claude chat: Used for training unless you opt out
Enterprise tiers: Contractual guarantees against training

The real risks:

Data breaches (they happen to everyone)
Subpoenas and legal requests
Employee access to conversation logs
Policy changes you won't notice

For most personal projects? Probably fine. For client work with NDAs? For healthcare/finance/legal? For your company's secret sauce? That "probably" starts to matter.

What You Can Actually Run Locally

Let me save you some time. Here's what works in 2025:

The Sweet Spot: 7B-13B Parameter Models

These run on consumer hardware and are actually useful:

Llama 3.1 8B - Meta's latest, great all-rounder
Mistral 7B - Excellent code understanding
CodeLlama 13B - Specifically tuned for programming
Phi-3 Mini - Microsoft's efficient small model

bash

# With Ollama (easiest option)
ollama pull llama3.1
ollama run llama3.1 "Explain this regex: ^(?=.*[A-Z])(?=.*[0-9]).{8,}$"

The Reality Check: What You Need

Here's the hardware truth:

Model Size	VRAM Needed	Actual Performance
7B	8GB	Usable, some waiting
13B	16GB	Good, noticeable latency
34B	24GB+	Great, needs beefy GPU
70B	48GB+	You need multiple GPUs

Running on CPU is possible but painful. A 7B model on M1 Mac is tolerable. On Intel without GPU? Make coffee between responses.

The Tools That Actually Work

Ollama (My Daily Driver)

Dead simple. Install, pull model, run. Works on Mac, Linux, Windows.

bash

# Install (Mac)
brew install ollama

# Start server
ollama serve

# In another terminal
ollama run codellama "Write a TypeScript function to debounce any function"

The API is OpenAI-compatible, so you can swap it into existing tools:

typescript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama' // Required but ignored
});

const response = await client.chat.completions.create({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Hello!' }]
});

llama.cpp (For The Tinkerers)

More control, more complexity. You're compiling from source, quantizing models, tuning parameters.

Worth it if you need:

Maximum performance
Custom quantization
Specific hardware optimization

Not worth it if you just want answers.

LM Studio (GUI Option)

If terminals scare you (no judgment), LM Studio provides a ChatGPT-like interface for local models. Download from their site, point at a model, chat away.

The Cost Analysis Nobody Does

Let's do actual math instead of vibes.

Cloud API Costs

Based on my usage patterns (roughly 50,000 tokens/day):

Service	Model	Daily Cost	Monthly
OpenAI	GPT-4 Turbo	$1.50	$45
Anthropic	Claude 3 Sonnet	$0.90	$27
OpenAI	GPT-3.5 Turbo	$0.10	$3

Local Running Costs

Hardware depreciation (assuming 3-year lifespan):

RTX 4080 ($1,200): ~$1.10/day
Mac M2 Pro ($2,000): ~$1.80/day
Electricity: ~$0.20/day for GPU under load

Daily cost: ~$1.30-2.00

The Surprising Conclusion

Local is more expensive for most usage patterns. But:

You already own the hardware (sunk cost)
No per-token anxiety
Works offline
Unlimited usage at fixed cost

If you're hitting 100k+ tokens daily, local wins. If you use AI occasionally, cloud is cheaper.

When Local Actually Makes Sense

1. You're Under NDA

Client code, proprietary algorithms, business logic. Local means no third party ever sees it.

bash

# Review confidential code locally
cat sensitive_algorithm.py | ollama run codellama "Review this for bugs"

2. You Need Offline Access

Flights, spotty internet, coffee shops with terrible WiFi. Local models don't care about your connection.

3. You're Building AI Features

If you're shipping AI features, paying per-token gets expensive fast. Local models for development, cloud for production (or local for production if you've got the infrastructure).

4. You're Learning/Experimenting

No rate limits, no costs, no judgment. Run the same prompt 500 times fine-tuning your approach.

5. Compliance Requirements

HIPAA, GDPR, SOC 2—sometimes the data simply cannot leave your infrastructure. Local is the only option.

When Cloud Is Just Better

Be honest with yourself:

Quality matters more than privacy: GPT-4 and Claude 3 Opus are still better than local models for complex reasoning
You need it occasionally: $20/month for ChatGPT Plus is fine
You don't want to maintain infrastructure: Cloud just works
You need latest capabilities: Vision, function calling, etc. come to cloud first

My Actual Setup

Here's what I actually use day-to-day:

For sensitive work:

bash

# Ollama running llama3.1 8B
alias ask='function _ask(){ echo "$1" | ollama run llama3.1; }; _ask'

# Usage
ask "Refactor this authentication logic: [paste code]"

For quality-critical work:

Claude API (via Cursor/Continue)
Opted out of training, API-only

For casual exploration:

ChatGPT Plus
Don't paste anything confidential

The Hybrid Approach (What Actually Works)

Here's my recommendation:

Install Ollama with llama3.1 and codellama
Use local for anything with confidential code
Use cloud for general questions, learning, complex reasoning
Route automatically based on content type

You can even automate this:

typescript

async function smartComplete(prompt: string, isConfidential: boolean) {
  if (isConfidential) {
    return await ollamaComplete(prompt, 'codellama');
  }
  return await openaiComplete(prompt, 'gpt-4-turbo');
}

Performance Reality Check

Local models are good. They're not GPT-4 good. Here's my honest assessment:

Task	Local (7B-13B)	Cloud (GPT-4)
Code completion	85%	95%
Explaining code	80%	95%
Complex refactoring	60%	90%
Architecture advice	40%	80%
Bug identification	70%	85%

For "write me a function," local is fine. For "design a system," you want the big models.

The Bottom Line

Running LLMs locally isn't paranoid—it's practical for specific use cases. But it's not a replacement for cloud AI. It's a complement.

Run local when:

Working with confidential code
You need offline access
Building AI-powered features
Experimenting without limits

Use cloud when:

Quality is priority
You need latest features
Infrastructure isn't your thing
Usage is low-to-moderate

The "privacy vs capability" tradeoff is real, but it's not binary. Use both. Be thoughtful about what goes where.

And maybe stop pasting your company's authentication code into ChatGPT at 2 AM. Get Ollama running and sleep better.

---

The best setup is the one that matches your actual needs. For most developers, that's a hybrid—local for sensitive work, cloud for everything else.

JV

Written by Jose Viscasillas

Senior Software Engineer building video platforms at ON24. 21 years of coding experience. I write about React, TypeScript, AI, and developer tools.

GitHub Twitter LinkedIn

Local LLMs vs Cloud APIs: When Privacy Actually Matters

Local LLMs vs Cloud APIs: When Privacy Actually Matters

The Privacy Argument (And Why Most People Get It Wrong)

What You Can Actually Run Locally

The Sweet Spot: 7B-13B Parameter Models

The Reality Check: What You Need

The Tools That Actually Work

Ollama (My Daily Driver)

llama.cpp (For The Tinkerers)

LM Studio (GUI Option)

The Cost Analysis Nobody Does

Cloud API Costs

Local Running Costs

The Surprising Conclusion

When Local Actually Makes Sense

1. You're Under NDA

2. You Need Offline Access

3. You're Building AI Features

4. You're Learning/Experimenting

5. Compliance Requirements

When Cloud Is Just Better

My Actual Setup

The Hybrid Approach (What Actually Works)

Performance Reality Check

The Bottom Line

Written by Jose Viscasillas

Recommended Reads

Fine-Tuning Models Is a Waste of Time (Usually)

AI Agents Are Overhyped But I'm Building One Anyway

I Built a RAG System. Here's Why You Probably Shouldn't.

Subscribe to the Newsletter