Video to Screenplay Conversion -; Project Index

Overview

Feed it a video and it works toward a screenplay: detect scenes, pull keyframes, track and recognize characters by face, transcribe the dialogue, describe each scene visually, attribute lines to characters, and format the result. The running test subject throughout is an old Murder, She Wrote episode -; a fitting choice, given the goal is to recover a script from finished footage.

Background

The interesting constraint here is the hardware. Rather than throw everything at one machine, the work is split by compute profile across two nodes that I actually have on the bench: a Raspberry Pi 5 with a Hailo-8L NPU as an edge node, and a desktop with an RTX 3090 for the model-heavy stages. The design reasoning is explicit -; every model picked has to fit inside the 3090's 24 GB of VRAM.

How It Works

The split is clean: cheap, real-time CV goes to the Pi; the VRAM-bound work goes to the 3090; they coordinate over Flask REST.

# work split by compute profile
[ Raspberry Pi 5 + Hailo-8L ]        [ Windows PC + RTX 3090 ]
  scene detection (PySceneDetect)  ->   Whisper transcription
  face detection / tracking        ->   keyframe captioning (BLIP)
  keyframe extraction (OpenCV)     ->   screenplay assembly

Character handling is the clever part. There's no face database to start -; characters get auto-IDs and are matched across frames by face-encoding distance. Names are then inferred from the dialogue: the system scans transcript segments near a character's on-screen appearances for capitalized words and direct address, and picks the most frequent. Dialogue is attributed by visual co-presence -; whoever's face is visible within a couple seconds of a spoken line is tagged as the speaker; nobody on screen falls back to NARRATOR.

Two honest caveats. The Hailo NPU acceleration and the Ollama LLM screenplay-formatting are the designed architecture and fully fleshed out in the notes, but in the shipped code face recognition runs on the Pi's CPU and the final screenplay is assembled with plain string formatting rather than an LLM call. The real-vs-aspirational line matters here.

Current Status

Archived as a working prototype. The core path runs end to end -; scene detection, faces, Whisper transcription, BLIP captions, text screenplay -; across the two machines, but the two headline accelerations are stubbed in the shipped scripts.

Pi↔3090 orchestration over Flask REST is real and working.
Hailo-8L acceleration is aspirational -; the shipped Pi code uses CPU face_recognition instead.
Ollama LLM formatting (Mistral/Llama) lives in the notes; the shipped formatter concatenates strings. No public deployment; LAN-only.