Reading time: ~20 minutes Audience: Homelab owners who want a private, offline AI assistant without monthly subscriptions Last tested: Proxmox VE 8.2, Ollama 0.5, OpenWebUI 0.4, June 2026


Why Self-Host an LLM?

Cloud AI services come with three problems for homelab users: privacy (your data leaves your network), cost (ChatGPT Plus is $20/month, Claude Pro is $20/month, and API bills add up fast), and availability (outages, rate limits, and terms-of-service changes).

Running Ollama on your Proxmox server solves all three. You get a private, always-on LLM that works even when your internet is down — and the only recurring cost is electricity.

What You’ll Build

By the end of this guide, you will have:

  • Ollama serving multiple open-weight models (Llama 3.3, Mistral, Phi-4, Qwen 2.5)
  • OpenWebUI providing a ChatGPT-style chat interface accessible from any device on your LAN
  • Optional GPU acceleration for 3–10× faster token generation
  • A Docker Compose stack you can version-control and redeploy in minutes

Hardware Requirements

Minimum (CPU-Only Inference)

Component Requirement Real-World Example
CPU 4 cores, x86_64 Intel N100, old i5-6500
RAM 16 GB (8 GB for OS + 8 GB for a 7B model) Any DDR4 system
Storage 30 GB free (models are 4–15 GB each) 256 GB SSD minimum
GPU None required CPU inference works

Recommended (GPU-Accelerated)

Component Requirement Real-World Example
CPU 6+ cores Intel i5-12400, Ryzen 5 5600
RAM 32 GB DDR4 3200 MHz
GPU NVIDIA RTX 3060 12 GB, RTX 4060 Ti 16 GB, or Intel ARC A770 16 GB Buy used on eBay
Storage 100 GB NVMe WD Black SN770

VRAM rule of thumb: A 7B parameter model at Q4_K_M quantization needs ~4.5 GB VRAM. A 13B model needs ~8 GB. A 34B model needs ~20 GB. Match your GPU VRAM to the model size you want to run.


Architecture Decision: LXC vs VM

Proxmox gives you two containerization options. Here is the honest trade-off:

Factor LXC Container Full VM
GPU passthrough Easier — bind-mount /dev/dri for Intel iGPU or /dev/nvidia* for NVIDIA Requires PCIe passthrough (locks GPU to one VM)
Performance Near-native CPU speed ~2–5% virtualization overhead
Docker compatibility Works if LXC is privileged + nesting=1 Works natively
Isolation Shares host kernel Fully isolated
Snapshot/backup Proxmox backup works Proxmox backup works
Best for Intel iGPU (Quick Sync), CPU-only inference Dedicated NVIDIA GPU, multi-tenant setups

Our recommendation: Start with an LXC container if you only need CPU inference or Intel iGPU acceleration. Use a VM if you have a dedicated NVIDIA GPU you want to pass through.

This guide covers both paths.


Path A: LXC Container Setup (CPU or Intel iGPU)

Step A1: Create the LXC Container

  1. In the Proxmox web UI, click Create CT.
  2. Template: Choose a Debian 12 or Ubuntu 24.04 template.
  3. Root Disk: Allocate 40 GB (you will need space for Docker images and LLM models).
  4. CPU: Assign 4–8 cores.
  5. Memory: Assign 16–32 GB (at least 12 GB for a 7B model).
  6. Network: DHCP or static IP on your LAN bridge.

Step A2: Enable Docker in LXC

After creation, select the container → Options → enable:

Option Value Why
unprivileged container No (uncheck) Docker requires privileges to manage cgroups and networks
Features: nesting Yes (check) Allows Docker-in-LXC
Features: keyctl Yes (check) Required by some container runtimes

Security note: An unprivileged LXC is safer, but Docker inside unprivileged LXC containers is unreliable. For a single-user homelab this trade-off is acceptable. If you need stronger isolation, use Path B (VM).

Step A3: Install Docker Engine

SSH into the container or use the Proxmox console:

# Update and install prerequisites
apt update && apt upgrade -y
apt install -y ca-certificates curl

# Add Docker's official GPG key and repository
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian $(. /etc/os-release && echo "$VERSION_CODENAME") stable" > /etc/apt/sources.list.d/docker.list

apt update
apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin

Verify:

docker run --rm hello-world

Step A4: (Optional) Pass Through Intel iGPU

If your Proxmox host has an Intel iGPU (UHD 630, UHD 730, Iris Xe), you can give the LXC access to it without full PCIe passthrough:

On the Proxmox host, find the render device:

ls -la /dev/dri/
# You should see: renderD128

Edit the LXC config file on the host (/etc/pve/lxc/<CTID>.conf) and add:

lxc.cgroup2.devices.allow: c 226:* rwm
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file

Inside the LXC, install the Intel compute runtime:

apt install -y intel-opencl-icd

Verify GPU visibility:

ls -la /dev/dri/
# Should show renderD128 and card0

Path B: Full VM Setup (Dedicated NVIDIA GPU)

Step B1: Create the VM

  1. In Proxmox, click Create VM.
  2. OS: Debian 12 or Ubuntu 24.04 ISO.
  3. System: Change BIOS to OVMF (UEFI) — required for GPU passthrough.
  4. Machine: Set to q35 for better PCIe compatibility.
  5. Disk: 60 GB, VirtIO SCSI.
  6. CPU: 6–12 cores, type host.
  7. Memory: 32 GB minimum.

Step B2: PCIe GPU Passthrough

This is the most technical step. You need to prevent the Proxmox host from claiming the GPU.

On the Proxmox host, identify your GPU:

lspci -nn | grep -i nvidia
# Example output: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106 [GeForce RTX 3060] [10de:2504]

Note the PCI ID (01:00.0) and the vendor:device IDs (10de:2504).

Edit /etc/default/grub and add to GRUB_CMDLINE_LINUX_DEFAULT:

intel_iommu=on iommu=pt vfio-pci.ids=10de:2504

(Replace 10de:2504 with your GPU’s actual IDs. For Intel CPUs use intel_iommu=on; for AMD use amd_iommu=on.)

Update GRUB and reboot:

update-grub
reboot

After reboot, add the GPU to the VM: VM → Hardware → Add → PCI Device → select your GPU. Check All Functions, ROM-Bar, and PCI-Express.

Step B3: Install NVIDIA Drivers Inside the VM

# Inside the VM (Debian 12 example)
apt update && apt install -y build-essential dkms
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.120/NVIDIA-Linux-x86_64-550.120.run
chmod +x NVIDIA-Linux-x86_64-550.120.run
./NVIDIA-Linux-x86_64-550.120.run --no-opengl-files

Verify:

nvidia-smi
# Should show GPU name, driver version, and CUDA version

Then install Docker as in Step A3 and the NVIDIA Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt update && apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Deploying Ollama + OpenWebUI with Docker Compose

Both paths converge here. Create your stack directory:

mkdir -p ~/ollama-stack && cd ~/ollama-stack

Create docker-compose.yml:

version: "3.8"

services:
  ollama:
    image: ollama/ollama:0.5
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    environment:
      - OLLAMA_KEEP_ALIVE=24h        # Keep model in memory between requests
      - OLLAMA_HOST=0.0.0.0           # Listen on all interfaces
      - OLLAMA_NUM_PARALLEL=2         # Max concurrent requests
    # For NVIDIA GPU (comment out if not using GPU):
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]
    # For Intel iGPU (comment out if not using device):
    # devices:
    #   - /dev/dri:/dev/dri

  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: openwebui
    restart: unless-stopped
    volumes:
      - openwebui_data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=         # Generate: openssl rand -hex 32
    depends_on:
      - ollama

volumes:
  ollama_data:
  openwebui_data:

Generate a secret key and start:

echo "WEBUI_SECRET_KEY=$(openssl rand -hex 32)" >> .env
docker compose up -d

Pulling Models and First Chat

Pull Your First Model

The easiest way: use OpenWebUI’s built-in model manager at http://your-server-ip:3000 → Settings → Models → Pull a model.

Or via CLI:

# Pull a compact model good for CPU inference (4 GB)
docker exec -it ollama ollama pull llama3.2:3b

# Pull a strong general-purpose model (4.7 GB, needs GPU for speed)
docker exec -it ollama ollama pull llama3.3:latest

# Pull a coding specialist (9 GB)
docker exec -it ollama ollama pull qwen2.5-coder:14b

# List installed models
docker exec -it ollama ollama list

Model Picks for Your Hardware

Hardware Recommended Models Speed (tokens/sec)
Intel N100 (CPU only) llama3.2:3b, phi3:mini, gemma2:2b 5–10 t/s
i5-12400 (CPU only) llama3.1:8b, mistral:7b, qwen2.5:7b 8–15 t/s
RTX 3060 12 GB llama3.1:8b, mistral-nemo:12b, qwen2.5:14b 40–80 t/s
RTX 4060 Ti 16 GB llama3.3:latest, qwen2.5:14b, codestral:22b 50–90 t/s
Dual RTX 3060 (24 GB) mixtral:8x7b, command-r:35b 25–45 t/s

First Chat via OpenWebUI

  1. Open http://your-server-ip:3000
  2. Create an admin account (first user becomes admin)
  3. Select your model from the dropdown (top-left)
  4. Type: “Explain VLANs like I’m setting up my first homelab network”
  5. Watch the tokens stream back

OpenWebUI also supports: - RAG (Retrieval-Augmented Generation): Upload PDFs, markdown files, or code repos and chat with them - Multiple models simultaneously: Compare answers side-by-side - Web search integration: Configure a search API (SearXNG, Google) for live data - Voice input/output: Using browser Web Speech API


Performance Optimization

1. Keep Models in Memory

The environment variable OLLAMA_KEEP_ALIVE=24h keeps the last-used model loaded in RAM/VRAM for 24 hours. Without it, Ollama unloads models after 5 minutes of inactivity, causing a 3–10 second cold-start delay on the next request.

2. Adjust Context Window

Default context is 2048 tokens. Increase for longer conversations:

docker exec -it ollama ollama run llama3.3:latest
# Inside the REPL:
/set parameter num_ctx 8192
/save my-llama3.3-8k

Or via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.3:latest",
  "prompt": "Hello",
  "options": { "num_ctx": 8192 }
}'

3. Quantization Level

When pulling models, you can specify quantization:

# Smaller, faster, slight quality loss
docker exec -it ollama ollama pull llama3.2:3b-q4_K_M

# Larger, slower, best quality
docker exec -it ollama ollama pull llama3.2:3b-q8_0

Q4_K_M is the sweet spot for most homelab use: 4-bit quantization with medium quality preservation, fitting 7B models in ~4 GB VRAM.

4. Concurrent Requests

Ollama can handle parallel requests. Set OLLAMA_NUM_PARALLEL=2 (or more with a powerful GPU) in the Compose file. This lets two family members chat simultaneously.

5. Disk I/O

Place Ollama data on an NVMe drive if possible. Model loading reads 4–15 GB from disk. On a SATA SSD, a 7B model loads in ~5 seconds; on NVMe, ~1 second.


Security Considerations

Do Not Expose to the Public Internet

Ollama and OpenWebUI have no built-in authentication by default. They are designed for LAN-only access. If you need remote access:

  1. Use Tailscale or WireGuard VPN to connect to your home network
  2. Put OpenWebUI behind a reverse proxy (NGINX Proxy Manager, Traefik, or Caddy) with HTTPS and HTTP Basic Auth
  3. Never port-forward port 3000 or 11434 directly

OpenWebUI Admin Account

The first user to register in OpenWebUI becomes the administrator. If you expose it to even your LAN, set a strong password immediately.

Model Provenance

Only pull models from Ollama’s official library or Hugging Face mirrors (via ollama pull). Untrusted GGUF files can contain malicious code. Ollama’s library (ollama.com/library) is curated and safe.

Resource Limits

In Docker Compose, you can cap CPU and memory:

services:
  ollama:
    # ... other config ...
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 16G

This prevents a runaway LLM inference from starving other containers.


Troubleshooting

Error: unable to load model — CUDA error: out of memory”

The model is too large for your GPU VRAM. Solutions: - Pull a smaller quantization: add -q4_K_M suffix - Use a smaller model: drop from 13B to 7B - Enable CPU offloading: set OLLAMA_NUM_GPU=12 (layers to offload to GPU, remainder on CPU)

OpenWebUI shows “Ollama connection refused”

  • Verify the Ollama container is running: docker compose ps
  • Check that OLLAMA_HOST=0.0.0.0 is set (default binds to 127.0.0.1 only)
  • From inside the OpenWebUI container, test: curl http://ollama:11434/api/tags

Extremely slow on CPU (1–3 tokens/sec)

  • CPU-only 7B models are inherently slow on consumer hardware
  • Switch to a 3B model (e.g., llama3.2:3b) for usable 5–10 t/s on CPU
  • Consider a used GPU: a GTX 1070 8 GB costs ~$80 on eBay and handles 7B models at 20+ t/s

LXC container won’t start after enabling features

Ensure nesting=1 and keyctl=1 are both enabled. If the issue persists, convert to a VM — Docker-in-LXC is convenient but fragile.


Monitoring and Maintenance

Check Ollama Logs

docker compose logs -f ollama

Update Models

Ollama models don’t auto-update. Periodically check:

docker exec -it ollama ollama list
docker exec -it ollama ollama pull llama3.3:latest  # Upgrades if newer version exists

Update the Stack

cd ~/ollama-stack
docker compose pull       # Pull latest images
docker compose up -d      # Recreate with new images
docker image prune -f     # Clean old images

Disk Usage

docker exec -it ollama du -sh /root/.ollama/models/
# Delete unused models:
docker exec -it ollama ollama rm phi3:mini

Beyond the Basics

Add SearXNG for Web Search

OpenWebUI supports web search via SearXNG. Add to your Compose file:

  searxng:
    image: searxng/searxng:latest
    container_name: searxng
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - searxng_data:/etc/searxng
    environment:
      - SEARXNG_BASE_URL=http://searxng:8080

Then in OpenWebUI → Settings → Web Search → set SearXNG URL to http://searxng:8080.

Add Stable Diffusion for Image Generation

  automatic1111:
    image: ghcr.io/ai-dock/stable-diffusion-webui:latest
    container_name: sd-webui
    restart: unless-stopped
    ports:
      - "7860:7860"
    volumes:
      - sd_data:/opt/stable-diffusion-webui
    # Add GPU config as with Ollama

OpenWebUI can connect to it for image generation within chats.

Automate Model Pushes with Ansible

If you manage multiple Proxmox nodes, use Ansible to ensure Ollama is deployed everywhere with consistent models. See our Automating Your Homelab with Ansible guide.


Conclusion

You now have a fully private, self-hosted AI assistant running on your Proxmox homelab. No monthly fees, no data leaving your network, and full control over which models you run.

What You Achieved

  • ✅ Ollama serving open-weight LLMs on your own hardware
  • ✅ OpenWebUI providing a polished chat interface accessible from any device
  • ✅ Optional GPU acceleration for fast inference
  • ✅ A reproducible Docker Compose stack

Next Steps


What models are you running in your homelab? Drop a comment below!

Subscribe to the WordForge newsletter for weekly self-hosting guides and Docker tips.