Reading time: ~20 minutes Audience: Homelab owners who want a private, offline AI assistant without monthly subscriptions Last tested: Proxmox VE 8.2, Ollama 0.5, OpenWebUI 0.4, June 2026
Why Self-Host an LLM?
Cloud AI services come with three problems for homelab users: privacy (your data leaves your network), cost (ChatGPT Plus is $20/month, Claude Pro is $20/month, and API bills add up fast), and availability (outages, rate limits, and terms-of-service changes).
Running Ollama on your Proxmox server solves all three. You get a private, always-on LLM that works even when your internet is down — and the only recurring cost is electricity.
What You’ll Build
By the end of this guide, you will have:
- Ollama serving multiple open-weight models (Llama 3.3, Mistral, Phi-4, Qwen 2.5)
- OpenWebUI providing a ChatGPT-style chat interface accessible from any device on your LAN
- Optional GPU acceleration for 3–10× faster token generation
- A Docker Compose stack you can version-control and redeploy in minutes
Hardware Requirements
Minimum (CPU-Only Inference)
| Component | Requirement | Real-World Example |
|---|---|---|
| CPU | 4 cores, x86_64 | Intel N100, old i5-6500 |
| RAM | 16 GB (8 GB for OS + 8 GB for a 7B model) | Any DDR4 system |
| Storage | 30 GB free (models are 4–15 GB each) | 256 GB SSD minimum |
| GPU | None required | CPU inference works |
Recommended (GPU-Accelerated)
| Component | Requirement | Real-World Example |
|---|---|---|
| CPU | 6+ cores | Intel i5-12400, Ryzen 5 5600 |
| RAM | 32 GB | DDR4 3200 MHz |
| GPU | NVIDIA RTX 3060 12 GB, RTX 4060 Ti 16 GB, or Intel ARC A770 16 GB | Buy used on eBay |
| Storage | 100 GB NVMe | WD Black SN770 |
VRAM rule of thumb: A 7B parameter model at Q4_K_M quantization needs ~4.5 GB VRAM. A 13B model needs ~8 GB. A 34B model needs ~20 GB. Match your GPU VRAM to the model size you want to run.
Architecture Decision: LXC vs VM
Proxmox gives you two containerization options. Here is the honest trade-off:
| Factor | LXC Container | Full VM |
|---|---|---|
| GPU passthrough | Easier — bind-mount /dev/dri for Intel iGPU or /dev/nvidia* for NVIDIA |
Requires PCIe passthrough (locks GPU to one VM) |
| Performance | Near-native CPU speed | ~2–5% virtualization overhead |
| Docker compatibility | Works if LXC is privileged + nesting=1 | Works natively |
| Isolation | Shares host kernel | Fully isolated |
| Snapshot/backup | Proxmox backup works | Proxmox backup works |
| Best for | Intel iGPU (Quick Sync), CPU-only inference | Dedicated NVIDIA GPU, multi-tenant setups |
Our recommendation: Start with an LXC container if you only need CPU inference or Intel iGPU acceleration. Use a VM if you have a dedicated NVIDIA GPU you want to pass through.
This guide covers both paths.
Path A: LXC Container Setup (CPU or Intel iGPU)
Step A1: Create the LXC Container
- In the Proxmox web UI, click Create CT.
- Template: Choose a Debian 12 or Ubuntu 24.04 template.
- Root Disk: Allocate 40 GB (you will need space for Docker images and LLM models).
- CPU: Assign 4–8 cores.
- Memory: Assign 16–32 GB (at least 12 GB for a 7B model).
- Network: DHCP or static IP on your LAN bridge.
Step A2: Enable Docker in LXC
After creation, select the container → Options → enable:
| Option | Value | Why |
|---|---|---|
unprivileged container |
No (uncheck) | Docker requires privileges to manage cgroups and networks |
Features: nesting |
Yes (check) | Allows Docker-in-LXC |
Features: keyctl |
Yes (check) | Required by some container runtimes |
Security note: An unprivileged LXC is safer, but Docker inside unprivileged LXC containers is unreliable. For a single-user homelab this trade-off is acceptable. If you need stronger isolation, use Path B (VM).
Step A3: Install Docker Engine
SSH into the container or use the Proxmox console:
# Update and install prerequisites
apt update && apt upgrade -y
apt install -y ca-certificates curl
# Add Docker's official GPG key and repository
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian $(. /etc/os-release && echo "$VERSION_CODENAME") stable" > /etc/apt/sources.list.d/docker.list
apt update
apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
Verify:
docker run --rm hello-world
Step A4: (Optional) Pass Through Intel iGPU
If your Proxmox host has an Intel iGPU (UHD 630, UHD 730, Iris Xe), you can give the LXC access to it without full PCIe passthrough:
On the Proxmox host, find the render device:
ls -la /dev/dri/
# You should see: renderD128
Edit the LXC config file on the host (/etc/pve/lxc/<CTID>.conf) and add:
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
Inside the LXC, install the Intel compute runtime:
apt install -y intel-opencl-icd
Verify GPU visibility:
ls -la /dev/dri/
# Should show renderD128 and card0
Path B: Full VM Setup (Dedicated NVIDIA GPU)
Step B1: Create the VM
- In Proxmox, click Create VM.
- OS: Debian 12 or Ubuntu 24.04 ISO.
- System: Change BIOS to OVMF (UEFI) — required for GPU passthrough.
- Machine: Set to
q35for better PCIe compatibility. - Disk: 60 GB, VirtIO SCSI.
- CPU: 6–12 cores, type
host. - Memory: 32 GB minimum.
Step B2: PCIe GPU Passthrough
This is the most technical step. You need to prevent the Proxmox host from claiming the GPU.
On the Proxmox host, identify your GPU:
lspci -nn | grep -i nvidia
# Example output: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106 [GeForce RTX 3060] [10de:2504]
Note the PCI ID (01:00.0) and the vendor:device IDs (10de:2504).
Edit /etc/default/grub and add to GRUB_CMDLINE_LINUX_DEFAULT:
intel_iommu=on iommu=pt vfio-pci.ids=10de:2504
(Replace 10de:2504 with your GPU’s actual IDs. For Intel CPUs use intel_iommu=on; for AMD use amd_iommu=on.)
Update GRUB and reboot:
update-grub
reboot
After reboot, add the GPU to the VM: VM → Hardware → Add → PCI Device → select your GPU. Check All Functions, ROM-Bar, and PCI-Express.
Step B3: Install NVIDIA Drivers Inside the VM
# Inside the VM (Debian 12 example)
apt update && apt install -y build-essential dkms
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.120/NVIDIA-Linux-x86_64-550.120.run
chmod +x NVIDIA-Linux-x86_64-550.120.run
./NVIDIA-Linux-x86_64-550.120.run --no-opengl-files
Verify:
nvidia-smi
# Should show GPU name, driver version, and CUDA version
Then install Docker as in Step A3 and the NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update && apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
Deploying Ollama + OpenWebUI with Docker Compose
Both paths converge here. Create your stack directory:
mkdir -p ~/ollama-stack && cd ~/ollama-stack
Create docker-compose.yml:
version: "3.8"
services:
ollama:
image: ollama/ollama:0.5
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
environment:
- OLLAMA_KEEP_ALIVE=24h # Keep model in memory between requests
- OLLAMA_HOST=0.0.0.0 # Listen on all interfaces
- OLLAMA_NUM_PARALLEL=2 # Max concurrent requests
# For NVIDIA GPU (comment out if not using GPU):
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
# For Intel iGPU (comment out if not using device):
# devices:
# - /dev/dri:/dev/dri
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: openwebui
restart: unless-stopped
volumes:
- openwebui_data:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY= # Generate: openssl rand -hex 32
depends_on:
- ollama
volumes:
ollama_data:
openwebui_data:
Generate a secret key and start:
echo "WEBUI_SECRET_KEY=$(openssl rand -hex 32)" >> .env
docker compose up -d
Pulling Models and First Chat
Pull Your First Model
The easiest way: use OpenWebUI’s built-in model manager at http://your-server-ip:3000 → Settings → Models → Pull a model.
Or via CLI:
# Pull a compact model good for CPU inference (4 GB)
docker exec -it ollama ollama pull llama3.2:3b
# Pull a strong general-purpose model (4.7 GB, needs GPU for speed)
docker exec -it ollama ollama pull llama3.3:latest
# Pull a coding specialist (9 GB)
docker exec -it ollama ollama pull qwen2.5-coder:14b
# List installed models
docker exec -it ollama ollama list
Model Picks for Your Hardware
| Hardware | Recommended Models | Speed (tokens/sec) |
|---|---|---|
| Intel N100 (CPU only) | llama3.2:3b, phi3:mini, gemma2:2b | 5–10 t/s |
| i5-12400 (CPU only) | llama3.1:8b, mistral:7b, qwen2.5:7b | 8–15 t/s |
| RTX 3060 12 GB | llama3.1:8b, mistral-nemo:12b, qwen2.5:14b | 40–80 t/s |
| RTX 4060 Ti 16 GB | llama3.3:latest, qwen2.5:14b, codestral:22b | 50–90 t/s |
| Dual RTX 3060 (24 GB) | mixtral:8x7b, command-r:35b | 25–45 t/s |
First Chat via OpenWebUI
- Open
http://your-server-ip:3000 - Create an admin account (first user becomes admin)
- Select your model from the dropdown (top-left)
- Type: “Explain VLANs like I’m setting up my first homelab network”
- Watch the tokens stream back
OpenWebUI also supports: - RAG (Retrieval-Augmented Generation): Upload PDFs, markdown files, or code repos and chat with them - Multiple models simultaneously: Compare answers side-by-side - Web search integration: Configure a search API (SearXNG, Google) for live data - Voice input/output: Using browser Web Speech API
Performance Optimization
1. Keep Models in Memory
The environment variable OLLAMA_KEEP_ALIVE=24h keeps the last-used model loaded in RAM/VRAM for 24 hours. Without it, Ollama unloads models after 5 minutes of inactivity, causing a 3–10 second cold-start delay on the next request.
2. Adjust Context Window
Default context is 2048 tokens. Increase for longer conversations:
docker exec -it ollama ollama run llama3.3:latest
# Inside the REPL:
/set parameter num_ctx 8192
/save my-llama3.3-8k
Or via the API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:latest",
"prompt": "Hello",
"options": { "num_ctx": 8192 }
}'
3. Quantization Level
When pulling models, you can specify quantization:
# Smaller, faster, slight quality loss
docker exec -it ollama ollama pull llama3.2:3b-q4_K_M
# Larger, slower, best quality
docker exec -it ollama ollama pull llama3.2:3b-q8_0
Q4_K_M is the sweet spot for most homelab use: 4-bit quantization with medium quality preservation, fitting 7B models in ~4 GB VRAM.
4. Concurrent Requests
Ollama can handle parallel requests. Set OLLAMA_NUM_PARALLEL=2 (or more with a powerful GPU) in the Compose file. This lets two family members chat simultaneously.
5. Disk I/O
Place Ollama data on an NVMe drive if possible. Model loading reads 4–15 GB from disk. On a SATA SSD, a 7B model loads in ~5 seconds; on NVMe, ~1 second.
Security Considerations
Do Not Expose to the Public Internet
Ollama and OpenWebUI have no built-in authentication by default. They are designed for LAN-only access. If you need remote access:
- Use Tailscale or WireGuard VPN to connect to your home network
- Put OpenWebUI behind a reverse proxy (NGINX Proxy Manager, Traefik, or Caddy) with HTTPS and HTTP Basic Auth
- Never port-forward port 3000 or 11434 directly
OpenWebUI Admin Account
The first user to register in OpenWebUI becomes the administrator. If you expose it to even your LAN, set a strong password immediately.
Model Provenance
Only pull models from Ollama’s official library or Hugging Face mirrors (via ollama pull). Untrusted GGUF files can contain malicious code. Ollama’s library (ollama.com/library) is curated and safe.
Resource Limits
In Docker Compose, you can cap CPU and memory:
services:
ollama:
# ... other config ...
deploy:
resources:
limits:
cpus: '4'
memory: 16G
This prevents a runaway LLM inference from starving other containers.
Troubleshooting
“Error: unable to load model — CUDA error: out of memory”
The model is too large for your GPU VRAM. Solutions:
- Pull a smaller quantization: add -q4_K_M suffix
- Use a smaller model: drop from 13B to 7B
- Enable CPU offloading: set OLLAMA_NUM_GPU=12 (layers to offload to GPU, remainder on CPU)
OpenWebUI shows “Ollama connection refused”
- Verify the Ollama container is running:
docker compose ps - Check that
OLLAMA_HOST=0.0.0.0is set (default binds to 127.0.0.1 only) - From inside the OpenWebUI container, test:
curl http://ollama:11434/api/tags
Extremely slow on CPU (1–3 tokens/sec)
- CPU-only 7B models are inherently slow on consumer hardware
- Switch to a 3B model (e.g.,
llama3.2:3b) for usable 5–10 t/s on CPU - Consider a used GPU: a GTX 1070 8 GB costs ~$80 on eBay and handles 7B models at 20+ t/s
LXC container won’t start after enabling features
Ensure nesting=1 and keyctl=1 are both enabled. If the issue persists, convert to a VM — Docker-in-LXC is convenient but fragile.
Monitoring and Maintenance
Check Ollama Logs
docker compose logs -f ollama
Update Models
Ollama models don’t auto-update. Periodically check:
docker exec -it ollama ollama list
docker exec -it ollama ollama pull llama3.3:latest # Upgrades if newer version exists
Update the Stack
cd ~/ollama-stack
docker compose pull # Pull latest images
docker compose up -d # Recreate with new images
docker image prune -f # Clean old images
Disk Usage
docker exec -it ollama du -sh /root/.ollama/models/
# Delete unused models:
docker exec -it ollama ollama rm phi3:mini
Beyond the Basics
Add SearXNG for Web Search
OpenWebUI supports web search via SearXNG. Add to your Compose file:
searxng:
image: searxng/searxng:latest
container_name: searxng
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- searxng_data:/etc/searxng
environment:
- SEARXNG_BASE_URL=http://searxng:8080
Then in OpenWebUI → Settings → Web Search → set SearXNG URL to http://searxng:8080.
Add Stable Diffusion for Image Generation
automatic1111:
image: ghcr.io/ai-dock/stable-diffusion-webui:latest
container_name: sd-webui
restart: unless-stopped
ports:
- "7860:7860"
volumes:
- sd_data:/opt/stable-diffusion-webui
# Add GPU config as with Ollama
OpenWebUI can connect to it for image generation within chats.
Automate Model Pushes with Ansible
If you manage multiple Proxmox nodes, use Ansible to ensure Ollama is deployed everywhere with consistent models. See our Automating Your Homelab with Ansible guide.
Conclusion
You now have a fully private, self-hosted AI assistant running on your Proxmox homelab. No monthly fees, no data leaving your network, and full control over which models you run.
What You Achieved
- ✅ Ollama serving open-weight LLMs on your own hardware
- ✅ OpenWebUI providing a polished chat interface accessible from any device
- ✅ Optional GPU acceleration for fast inference
- ✅ A reproducible Docker Compose stack
Next Steps
- Set up NGINX Proxy Manager for local domains — access OpenWebUI at
ai.yourlan.local - Monitor your stack with Grafana and Prometheus — track GPU utilization and request latency
- Explore more self-hosted apps — build a complete privacy-first toolkit
- Proxmox Beginner Guide — if you’re new to Proxmox, start here
What models are you running in your homelab? Drop a comment below!
Subscribe to the WordForge newsletter for weekly self-hosting guides and Docker tips.