Kiwi
Team led by NUS Master's AI Engineers who ship computer vision systems (DINOv2, YOLOv8, TensorRT) and autonomous LLM agent platforms.
YouTube Video
Project Description
Kiwi — Autonomous Multi-Agent Video Creation Studio
Kiwi is an end-to-end autonomous multi-agent system that transforms either text or audio input into a fully produced video: script, storyboard, visuals, narration, and a final MP4 output. Many people have meaningful ideas but lack the skills or tools to create videos. Kiwi removes this barrier by enabling anyone to turn imagination into polished content through a flexible, conversational workflow.
Kiwi supports two creation paths:
- Users may choose to generate the video immediately from their input, or
- They may enter a guided conversation, where the system asks targeted follow-up questions to clarify tone, pacing, visual style, emotional intent, and scene details.
This dual-mode design lets users be as hands-off or as expressive as they prefer, while ensuring the final video accurately reflects their vision.
How Kiwi Works (End-to-End)
- The user provides an initial text prompt or audio input through a Next.js web interface. Audio is transcribed automatically.
- The user chooses between:
- Instant Generation — the system proceeds directly to production, or
- Guided Discussion — the DirectorOrchestrator asks follow-up questions to refine unclear or missing details.
- Once inputs are confirmed, the DirectorOrchestrator constructs a detailed Creative Brief using prompt-chaining.
- Parallel Phase 1
- StoryLoaderAgent → script
- StoryboardAgent → shot plan
These run in parallel since both depend solely on the Creative Brief.
- Parallel Phase 2
- VoiceActorAgent → narration from the script using ElevenLabs
- FilmCrewAgent → video scenes from the storyboard using Veo 3
Each starts immediately once its prerequisite is ready, enabling dependency-aware parallelism.
- MoviePy merges narration and visuals into a complete MP4.
- Clerk authentication manages user-specific sessions, ensuring private and isolated video generations.
The entire pipeline runs autonomously end-to-end while offering users full control over how much creative guidance they provide.
Working Prototype
Kiwi is fully functional: it handles audio or text input, supports both instant and interactive modes, refines creative direction through conversation when needed, executes multi-agent workflows in parallel, and outputs a high-quality downloadable MP4. The system operates reliably across the full creation process.
Technical Complexity & Integration
Kiwi integrates advanced multimodal and agentic technologies:
- Gemini Pro 3 for reasoning, follow-up questioning, and agent coordination
- Veo 3 for high-quality video generation
- ElevenLabs for voice synthesis
- MoviePy for audio–video composition
- Next.js for browser execution
- Clerk for authentication
- CodeRabbit for automated PR review
The system demonstrates sophisticated orchestration, multimodal handling, dynamic routing, dependency-driven scheduling, and a seamless browser-to-cloud pipeline.
Innovation & Creativity
Kiwi reframes video creation as a flexible, conversational filmmaking process. Users may generate videos instantly or collaborate with the system through guided refinement, similar to interacting with a human director. A coordinated team of AI agents then produces every creative component, offering a new paradigm for accessible and expressive storytelling.
Real-World Impact
Kiwi democratizes storytelling by allowing anyone—creators, educators, marketers, families—to produce compelling videos without technical skills. Support for both instant and guided creation reduces friction, saves time, and empowers users to articulate their ideas effectively while producing results that match their intent.
Theme Alignment
Kiwi embodies the hackathon’s focus on agentic AI: a conversational, multimodal system that understands intent, clarifies requirements, autonomously decomposes tasks, coordinates specialized agents, and generates complete videos from a single input. It tightly integrates partner technologies into a cohesive, production-ready workflow.
Technologies Used
- Gemini Pro 3 — reasoning, clarification dialogue, task orchestration
- Veo 3 — video generation
- ElevenLabs — voice synthesis
- MoviePy — audio–video merging
- Next.js — user interface
- Clerk — authentication
- CodeRabbit — automated PR review
Prior Work
This project was built from scratch.