Files

10 KiB

LLM Base Media Server — Requirements

Overview

A self-hosted video media server that uses LLMs to automatically index, tag, classify, and generate metadata/descriptions/subtitles for a personal video library (movies, TV shows, home videos). Inspired by Plex/Jellyfin but with deep LLM integration. Deployable via Docker or bare-metal.


1. Functional Requirements

1.1 Media Ingestion & Library Management

  • Watch configured directories for new media files
  • Support formats: MKV, MP4, AVI, MOV, WEBM, TS
  • Fingerprint media files to detect duplicates
  • Organize library by: Movies / TV Shows / Home Videos
  • Store library state in a local database (SQLite or embedded)

1.2 LLM-Powered Auto-Classification

  • Identify content type (movie, TV episode, home video) from filename + video analysis
  • Match movies/TV shows against known titles (local heuristics + LLM reasoning)
  • Extract season/episode numbers for TV shows
  • Classify content genre, mood, rating (family-safe, etc.)
  • Confidence scoring for all LLM-generated tags; flag low-confidence for manual review

1.3 Metadata Generation

Metadata priority order: TMDB/IMDB first → LLM as fallback/supplement.

Primary: External Data Sources

  • Fetch structured metadata from TMDB (movies, TV shows): title, year, director, cast, genre, runtime, language, poster, backdrop
  • Fetch ratings and IDs from IMDB (via TMDB's IMDB ID field)
  • Fetch TV episode details from TMDB: season/episode synopsis, air date, guest cast
  • Cache API responses locally to avoid redundant external calls
  • Store source attribution (e.g., metadata_source: tmdb) per item

Fallback/Supplement: LLM-Generated

LLM metadata is used only when the external source returns no match or partial data:

  • Generate description/summary for unmatched titles (home videos, obscure content)
  • Fill missing fields that TMDB/IMDB did not return
  • Tag and describe home videos (no external source exists for these)
  • Mark all LLM-generated fields with a llm_generated: true flag for transparency

General

  • Embed/store final metadata in sidecar files (NFO/JSON) alongside originals
  • Thumbnail/poster: use TMDB poster if available, else extract keyframe via ffmpeg

1.4 LLM Provider Abstraction

  • Support local LLMs via Ollama (llama.cpp-compatible)
  • Support Claude API (Anthropic) for cloud inference
  • Support OpenAI-compatible APIs
  • Provider selection per task type (e.g., local for tagging, cloud for summaries)
  • Graceful fallback: cloud → local if cloud unavailable
  • Token/cost tracking for cloud providers

1.5 Streaming & Playback

  • HTTP streaming with range request support
  • HLS adaptive bitrate transcoding (via ffmpeg)
  • Direct play for supported client formats
  • Basic web UI for browsing and playback

1.6 Search & Discovery

  • Full-text search across titles, descriptions, tags
  • Natural language search ("action movies from the 90s", "videos of kids at the beach")
  • Filter by: genre, year, rating, tags, classification confidence

1.7 API

  • REST API for all library operations
  • Webhook/event system for processing status updates
  • API key authentication

2. Non-Functional Requirements

2.1 Performance

  • Processing pipeline must not block streaming; run as background workers
  • Streaming latency < 2s for direct play; < 5s for transcoded streams
  • Support concurrent streams: minimum 2 simultaneous (hardware-dependent)

2.2 Storage

  • Metadata and index stored locally (no mandatory cloud dependency)
  • Sidecar files (.nfo, .json) stored alongside media
  • SQLite for metadata DB (upgradeable to PostgreSQL via config)

2.3 Privacy

  • All LLM inference can run fully local (no data leaves the machine)
  • Cloud LLM calls are opt-in and clearly logged
  • No telemetry by default

2.4 Reliability

  • Crash recovery: resume interrupted processing jobs on restart
  • Idempotent processing: re-indexing a file does not duplicate metadata
  • Graceful degradation: server remains operational if LLM provider is unavailable

2.5 Portability

  • Docker image with bundled ffmpeg
  • Single binary for bare-metal deployment (Go preferred for this)
  • Config via TOML file + environment variable overrides

3. Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     Media Server Core                    │
│                                                         │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  Watcher │→ │  Ingest      │→ │  Processing      │  │
│  │ (inotify)│  │  Pipeline    │  │  Queue (workers) │  │
│  └──────────┘  └──────────────┘  └────────┬─────────┘  │
│                                           │             │
│              ┌────────────────────────────┤             │
│              ▼            ▼               ▼             │
│            ┌──────────────┐  ┌──────────────┐          │
│            │ Metadata     │  │Classification│          │
│            │ Worker       │  │Worker        │          │
│            └──────┬───────┘  └──────┬───────┘          │
│                   └─────────────────┘                   │
│                          │                              │
│                   ┌──────▼──────┐                       │
│                   │ LLM Router  │                        │
│                   └──┬──────┬───┘                       │
│                      │      │                           │
│              ┌───────┘      └──────────┐                │
│         ┌────▼─────┐           ┌───────▼──────┐         │
│         │  Ollama  │           │  Claude/OAI  │         │
│         │ (local)  │           │  (cloud)     │         │
│         └──────────┘           └──────────────┘         │
│                                                         │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────┐  │
│  │  SQLite DB   │  │  HTTP API   │  │  Web UI       │  │
│  │  (metadata)  │  │  (REST)     │  │  (player)     │  │
│  └──────────────┘  └─────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────┘

4. Tech Stack Decisions

Concern Choice Rationale
Backend language Rust Memory safety, zero-cost abstractions, ideal for media processing throughput
Web framework Axum Async, tower-compatible, ergonomic REST routing
Async runtime Tokio Industry-standard async runtime for Rust
Database SQLite (sqlx) Zero-config, embeddable, async support via sqlx
Media processing ffmpeg (subprocess) Industry standard, broad format support
LLM (local) Ollama REST API Simple HTTP interface, model management built-in
LLM (cloud) Anthropic + OpenAI HTTP API Via reqwest, provider abstraction layer
Containerization Docker + Compose Multi-service: server + ollama + optional GPU
Config format TOML Human-friendly, serde-compatible (toml crate)
Frontend TBD Pure backend for now; API-first design

Frontend deferred: The server exposes a clean REST API. Frontend tech will be decided after core backend is stable.


5. MVP Scope (Phase 1)

Goal: Working library scanner + LLM tagging + REST API + streaming (pure backend)

  • Directory watcher + file ingestion
  • Movie/TV classification (filename heuristics + LLM disambiguation)
  • Metadata fetch from TMDB API (primary); LLM fills unmatched/missing fields only
  • Thumbnail extraction
  • SQLite metadata store (sqlx async)
  • REST API: list library, get item, trigger re-scan
  • Direct-play HTTP streaming
  • Ollama integration (local LLM)
  • Docker Compose setup
  • Frontend: TBD (API-first, no UI in MVP)

6. Phase 2 (Post-MVP)

  • HLS transcoding
  • Claude API / OpenAI API integration + provider router
  • Natural language search
  • External metadata sources (TVDB, Trakt) enrichment
  • Multi-user support with watch history

7. Phase 3 (Future)

  • Mobile-friendly UI / PWA
  • GPU-accelerated transcoding
  • Home video scene detection + auto-chapter marking
  • Face recognition for home video tagging
  • Collections and playlists
  • Client apps (Jellyfin protocol compatibility)

8. Open Questions

  • Should the server expose a Jellyfin-compatible API to leverage existing clients (Infuse, Swiftfin)?
  • GPU passthrough in Docker for local LLM acceleration — required or optional?
  • Should home video tagging use vision models (LLaVA/Claude vision) for frame analysis?
  • TMDB/TVDB integration: decided — TMDB is the primary metadata source; LLM fills gaps only
  • Multi-user: single-user MVP acceptable, or needed from day one?