Local LLMs vs Cloud APIs: When Privacy Actually Matters
Every time I paste code into ChatGPT, a small voice in my head asks: "Is this going to end up in someone's training data?"
The answer is probably no (if you've opted out). But "probably" isn't good enough when you're working with client code, proprietary algorithms, or that embarrassing regex you wrote at 2 AM.
So I started running LLMs locally. Here's the honest breakdown of when it makes sense, when it doesn't, and what it actually costs.
The Privacy Argument (And Why Most People Get It Wrong)
The discourse around AI privacy is polarized. One camp says "OpenAI sees everything you type." The other says "Who cares, my code isn't special."
Neither is completely right.
Here's what actually happens with cloud AI services:
OpenAI/Anthropic (as of early 2025):
- API calls: Not used for training by default
- ChatGPT/Claude chat: Used for training unless you opt out
- Enterprise tiers: Contractual guarantees against training
The real risks:
- Data breaches (they happen to everyone)
- Subpoenas and legal requests
- Employee access to conversation logs
- Policy changes you won't notice
For most personal projects? Probably fine. For client work with NDAs? For healthcare/finance/legal? For your company's secret sauce? That "probably" starts to matter.
What You Can Actually Run Locally
Let me save you some time. Here's what works in 2025:
The Sweet Spot: 7B-13B Parameter Models
These run on consumer hardware and are actually useful:
- Llama 3.1 8B - Meta's latest, great all-rounder
- Mistral 7B - Excellent code understanding
- CodeLlama 13B - Specifically tuned for programming
- Phi-3 Mini - Microsoft's efficient small model
# With Ollama (easiest option)
ollama pull llama3.1
ollama run llama3.1 "Explain this regex: ^(?=.*[A-Z])(?=.*[0-9]).{8,}$"The Reality Check: What You Need
Here's the hardware truth:
| Model Size | VRAM Needed | Actual Performance |
|---|---|---|
| 7B | 8GB | Usable, some waiting |
| 13B | 16GB | Good, noticeable latency |
| 34B | 24GB+ | Great, needs beefy GPU |
| 70B | 48GB+ | You need multiple GPUs |
Running on CPU is possible but painful. A 7B model on M1 Mac is tolerable. On Intel without GPU? Make coffee between responses.
The Tools That Actually Work
Ollama (My Daily Driver)
Dead simple. Install, pull model, run. Works on Mac, Linux, Windows.
# Install (Mac)
brew install ollama
# Start server
ollama serve
# In another terminal
ollama run codellama "Write a TypeScript function to debounce any function"The API is OpenAI-compatible, so you can swap it into existing tools:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama' // Required but ignored
});
const response = await client.chat.completions.create({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Hello!' }]
});llama.cpp (For The Tinkerers)
More control, more complexity. You're compiling from source, quantizing models, tuning parameters.
Worth it if you need:
- Maximum performance
- Custom quantization
- Specific hardware optimization
Not worth it if you just want answers.
LM Studio (GUI Option)
If terminals scare you (no judgment), LM Studio provides a ChatGPT-like interface for local models. Download from their site, point at a model, chat away.
The Cost Analysis Nobody Does
Let's do actual math instead of vibes.
Cloud API Costs
Based on my usage patterns (roughly 50,000 tokens/day):
| Service | Model | Daily Cost | Monthly |
|---|---|---|---|
| OpenAI | GPT-4 Turbo | $1.50 | $45 |
| Anthropic | Claude 3 Sonnet | $0.90 | $27 |
| OpenAI | GPT-3.5 Turbo | $0.10 | $3 |
Local Running Costs
Hardware depreciation (assuming 3-year lifespan):
- RTX 4080 ($1,200): ~$1.10/day
- Mac M2 Pro ($2,000): ~$1.80/day
- Electricity: ~$0.20/day for GPU under load
Daily cost: ~$1.30-2.00
The Surprising Conclusion
Local is more expensive for most usage patterns. But:
- You already own the hardware (sunk cost)
- No per-token anxiety
- Works offline
- Unlimited usage at fixed cost
If you're hitting 100k+ tokens daily, local wins. If you use AI occasionally, cloud is cheaper.
When Local Actually Makes Sense
1. You're Under NDA
Client code, proprietary algorithms, business logic. Local means no third party ever sees it.
# Review confidential code locally
cat sensitive_algorithm.py | ollama run codellama "Review this for bugs"2. You Need Offline Access
Flights, spotty internet, coffee shops with terrible WiFi. Local models don't care about your connection.
3. You're Building AI Features
If you're shipping AI features, paying per-token gets expensive fast. Local models for development, cloud for production (or local for production if you've got the infrastructure).
4. You're Learning/Experimenting
No rate limits, no costs, no judgment. Run the same prompt 500 times fine-tuning your approach.
5. Compliance Requirements
HIPAA, GDPR, SOC 2—sometimes the data simply cannot leave your infrastructure. Local is the only option.
When Cloud Is Just Better
Be honest with yourself:
- Quality matters more than privacy: GPT-4 and Claude 3 Opus are still better than local models for complex reasoning
- You need it occasionally: $20/month for ChatGPT Plus is fine
- You don't want to maintain infrastructure: Cloud just works
- You need latest capabilities: Vision, function calling, etc. come to cloud first
My Actual Setup
Here's what I actually use day-to-day:
For sensitive work:
# Ollama running llama3.1 8B
alias ask='function _ask(){ echo "$1" | ollama run llama3.1; }; _ask'
# Usage
ask "Refactor this authentication logic: [paste code]"For quality-critical work:
- Claude API (via Cursor/Continue)
- Opted out of training, API-only
For casual exploration:
- ChatGPT Plus
- Don't paste anything confidential
The Hybrid Approach (What Actually Works)
Here's my recommendation:
- Install Ollama with llama3.1 and codellama
- Use local for anything with confidential code
- Use cloud for general questions, learning, complex reasoning
- Route automatically based on content type
You can even automate this:
async function smartComplete(prompt: string, isConfidential: boolean) {
if (isConfidential) {
return await ollamaComplete(prompt, 'codellama');
}
return await openaiComplete(prompt, 'gpt-4-turbo');
}Performance Reality Check
Local models are good. They're not GPT-4 good. Here's my honest assessment:
| Task | Local (7B-13B) | Cloud (GPT-4) |
|---|---|---|
| Code completion | 85% | 95% |
| Explaining code | 80% | 95% |
| Complex refactoring | 60% | 90% |
| Architecture advice | 40% | 80% |
| Bug identification | 70% | 85% |
For "write me a function," local is fine. For "design a system," you want the big models.
The Bottom Line
Running LLMs locally isn't paranoid—it's practical for specific use cases. But it's not a replacement for cloud AI. It's a complement.
Run local when:
- Working with confidential code
- You need offline access
- Building AI-powered features
- Experimenting without limits
Use cloud when:
- Quality is priority
- You need latest features
- Infrastructure isn't your thing
- Usage is low-to-moderate
The "privacy vs capability" tradeoff is real, but it's not binary. Use both. Be thoughtful about what goes where.
And maybe stop pasting your company's authentication code into ChatGPT at 2 AM. Get Ollama running and sleep better.
---
The best setup is the one that matches your actual needs. For most developers, that's a hybrid—local for sensitive work, cloud for everything else.