I Built a Fully Offline AI Butler. His Name is Alfred.
A few weeks ago, I didn't have a single local AI model running on my machine. Today I can say "Hey Alfred" out loud and a British butler answers me — summarising news, opening apps, telling me the time, and doing it all completely offline, on my own hardware, with zero subscriptions and zero data leaving my house.
Here's exactly how I got there, and why you should probably do the same.
Why run a local LLM at all?
The honest answer: privacy and resilience.
Every time you type something into ChatGPT or Claude, that conversation goes to a server somewhere. For most things, that's fine. But there's something fundamentally different about an AI that runs entirely on your own machine — one that works when your internet goes down, one that doesn't bill you per query, one that has no idea what you asked it at 2am.
I've also been retrenched before because of AI automation. That experience made me want to understand this technology, not just consume it. Running it locally felt like the right way to do that.
The stack — what you actually need
Before I get into the Alfred stuff, let me break down what a local AI setup actually looks like, because it's simpler than most people think.
Ollama is the engine. It's a free, open-source tool that downloads and runs AI models locally. You install it, pull a model, and you're running an LLM on your own machine. That's genuinely it. It runs a local API on your machine that other tools can connect to.
Msty is the interface. Think of it as a ChatGPT-style chat UI that connects to your local Ollama instance. You don't have to use Msty specifically — Open WebUI is a solid free alternative — but Msty auto-detects Ollama and gets you chatting in about five minutes.
The models — this is where it gets interesting. Ollama supports dozens of models. The one that makes the most sense depends entirely on your hardware. If you're running a mid-range GPU like my RTX 4060 with 4GB VRAM, you're looking at 3B-4B parameter models. That sounds small, but modern small models are genuinely capable. I'm using llama3.2:3b for most things, with Qwen 3.5 as an upgrade target.
Hardware reality check
My setup:
- RTX 4060 (4GB VRAM)
- Ryzen 5 5500
- 16GB RAM
- Windows 11
This is a solid mid-range gaming PC — nothing exotic. If you're running anything similar, you can run local models comfortably. The VRAM is your main constraint. 4GB gets you clean 3B models with headroom to spare. 6-8GB opens up the 7B models, which are meaningfully smarter.
Getting it running — the short version
- Download Ollama from ollama.com
- Open a terminal and run
ollama run llama3.2:3b - That's it. You're running a local LLM.
- Install Msty from msty.app for a proper chat interface
- Msty auto-detects Ollama — you'll see your models ready to go
The whole process takes maybe 30 minutes, including the model download. After that, everything is offline. No internet required, ever, for normal use.
Now for the fun part — Project Alfred
Once I had a working local model, I wanted to push it further. A chat interface is useful, but I wanted something more interesting — something that showed what local AI can actually do when you give it a bit of work.
The idea: a voice-activated AI assistant that runs completely offline. Listens for a wake word, hears what I say, thinks about it, and talks back. And because I could give it whatever personality I wanted, I went with Alfred Pennyworth — Bruce Wayne's butler — except he works for me now.
Here's the full stack for Alfred:
openWakeWord — detects "Hey Alfred" and nothing else. I trained a custom wake word model for about R60 on openwakeword.com. The trained model lives locally and runs completely offline.
Faster Whisper — converts my speech to text. This is an optimised version of OpenAI's Whisper model. Runs on CPU, dead accurate, no internet needed.
Ollama + llama3.2:3b — the brain. Once Whisper has text, it goes to Ollama with a system prompt that tells the model who Alfred is and how he should behave.
Piper TTS — converts Alfred's response back to speech, using a British male voice. There's a pyttsx3 fallback in case Piper isn't available.
Python — about 500 lines of clean, single-file code that connects everything together. Thank you, CLAUDE.
What Alfred can actually do
Here's what works right now:
- "Hey Alfred, open YouTube" — opens YouTube in my browser
- "Hey Alfred, what time is it?" — "It is 3:42 in the afternoon, sir."
- "Hey Alfred, what's the date?" — "Today is Wednesday, the 9th of April, sir."
- "Hey Alfred, search for local welding courses near Cape Town" — searches DuckDuckGo, pulls the results, summarises them via Ollama, and reads the summary back to me
- "Hey Alfred, open VS Code" — launches VS Code
- "Hey Alfred, open my projects folder" — opens my projects directory in Explorer
- Anything else just goes straight to the LLM, and Alfred answers in character
And if my internet is down — which in Cape Town happens more than I'd like — Alfred says, "I'm afraid the manor's connection is down, sir", and answers from his own knowledge where he can.
The bit people always ask about
Is it as smart as ChatGPT?
Honestly, no — not on 3B parameters. For complex reasoning or long document analysis, a hosted API still wins. But for a voice assistant that opens apps, answers questions, tells me the time, and searches the web? It's more than capable. And it responds in 5-10 seconds from wake word to spoken answer, which is perfectly usable.
The more interesting point: as better models release, I change one line in my Python script, and Alfred instantly gets smarter. The whole stack is model-agnostic. When I'm ready to upgrade my GPU, bigger models are one ollama pull away.
Why this matters
What I've built is genuinely useful, but it's also a proof of concept for something bigger: local AI is real and accessible right now.
You don't need a server room. You don't need a computer science degree. You need a mid-range PC, an afternoon, and the willingness to run a few terminal commands.
The models are open source. The tools are free. The hardware is stuff most people already own.
Alfred cost me R60 to train a wake word. Everything else was free. He runs offline, he's private, and he's mine.
That feels important.
I'm Brendan Glover — maker, welder, and vibe coder based in Cape Town. This is what I'm building. Follow along.


Comments
Post a Comment