Kiwi - ElevenLabs Worldwide Hackathon
AI Tinkerers - Singapore
Hackathon Showcase Best CodeRabbit Winner

Kiwi

Team led by NUS Master's AI Engineers who ship computer vision systems (DINOv2, YOLOv8, TensorRT) and autonomous LLM agent platforms.

3 members Watch Demo

Kiwi — Autonomous Multi-Agent Video Creation Studio

Kiwi is an end-to-end autonomous multi-agent system that transforms either text or audio input into a fully produced video: script, storyboard, visuals, narration, and a final MP4 output. Many people have meaningful ideas but lack the skills or tools to create videos. Kiwi removes this barrier by enabling anyone to turn imagination into polished content through a flexible, conversational workflow.

Kiwi supports two creation paths:

  1. Users may choose to generate the video immediately from their input, or
  2. They may enter a guided conversation, where the system asks targeted follow-up questions to clarify tone, pacing, visual style, emotional intent, and scene details.

This dual-mode design lets users be as hands-off or as expressive as they prefer, while ensuring the final video accurately reflects their vision.


How Kiwi Works (End-to-End)

  1. The user provides an initial text prompt or audio input through a Next.js web interface. Audio is transcribed automatically.
  2. The user chooses between:
  • Instant Generation — the system proceeds directly to production, or
  • Guided Discussion — the DirectorOrchestrator asks follow-up questions to refine unclear or missing details.
  1. Once inputs are confirmed, the DirectorOrchestrator constructs a detailed Creative Brief using prompt-chaining.
  2. Parallel Phase 1
  • StoryLoaderAgent → script
  • StoryboardAgent → shot plan
    These run in parallel since both depend solely on the Creative Brief.
  1. Parallel Phase 2
  • VoiceActorAgent → narration from the script using ElevenLabs
  • FilmCrewAgent → video scenes from the storyboard using Veo 3
    Each starts immediately once its prerequisite is ready, enabling dependency-aware parallelism.
  1. MoviePy merges narration and visuals into a complete MP4.
  2. Clerk authentication manages user-specific sessions, ensuring private and isolated video generations.

The entire pipeline runs autonomously end-to-end while offering users full control over how much creative guidance they provide.


Working Prototype

Kiwi is fully functional: it handles audio or text input, supports both instant and interactive modes, refines creative direction through conversation when needed, executes multi-agent workflows in parallel, and outputs a high-quality downloadable MP4. The system operates reliably across the full creation process.


Technical Complexity & Integration

Kiwi integrates advanced multimodal and agentic technologies:

  • Gemini Pro 3 for reasoning, follow-up questioning, and agent coordination
  • Veo 3 for high-quality video generation
  • ElevenLabs for voice synthesis
  • MoviePy for audio–video composition
  • Next.js for browser execution
  • Clerk for authentication
  • CodeRabbit for automated PR review

The system demonstrates sophisticated orchestration, multimodal handling, dynamic routing, dependency-driven scheduling, and a seamless browser-to-cloud pipeline.


Innovation & Creativity

Kiwi reframes video creation as a flexible, conversational filmmaking process. Users may generate videos instantly or collaborate with the system through guided refinement, similar to interacting with a human director. A coordinated team of AI agents then produces every creative component, offering a new paradigm for accessible and expressive storytelling.


Real-World Impact

Kiwi democratizes storytelling by allowing anyone—creators, educators, marketers, families—to produce compelling videos without technical skills. Support for both instant and guided creation reduces friction, saves time, and empowers users to articulate their ideas effectively while producing results that match their intent.


Theme Alignment

Kiwi embodies the hackathon’s focus on agentic AI: a conversational, multimodal system that understands intent, clarifies requirements, autonomously decomposes tasks, coordinates specialized agents, and generates complete videos from a single input. It tightly integrates partner technologies into a cohesive, production-ready workflow.


Technologies Used

  • Gemini Pro 3 — reasoning, clarification dialogue, task orchestration
  • Veo 3 — video generation
  • ElevenLabs — voice synthesis
  • MoviePy — audio–video merging
  • Next.js — user interface
  • Clerk — authentication
  • CodeRabbit — automated PR review

This project was built from scratch.

Clerk CodeRabbit ElevenLabs Google DeepMind

Github source code

Summarizing URL...